LogoCND
← Back to Schedule

It's 3 AM and Your Kubernetes Cluster is Down: Why AI Should Be On-Call Instead of You

by Viktor Farcic, Upbound

📍 Atlas Keynote

10:15 – 10:45

It's 3 AM. Your phone buzzes. Production is down. A Pod won't start. You stumble out of bed, run kubectl events, wade through hundreds of normal events to find the one warning that matters, describe the Pod, trace back through the ReplicaSet to the Deployment, realize a PersistentVolumeClaim is missing, write the YAML, apply it, validate the fix. Thirty minutes later, you're back in bed, wondering if there's a better way.

There is. What if AI could detect the issue, analyze the root cause, suggest a fix, and validate that it worked? What if all four phases happened automatically, or at least with your approval, while you stayed in bed?

This talk demonstrates a complete AI-powered Kubernetes remediation system built on the Model Context Protocol (MCP) and agentic AI. We'll walk through the traditional manual troubleshooting process so you understand exactly what we're automating, then show how an AI agent can handle detection, analysis, remediation, and validation. You'll see a Kubernetes controller that monitors events and triggers automated responses, an MCP server that orchestrates LLM-powered root cause analysis, and a security model that keeps write operations firmly under your control.