MLOps for GenAI: Enabling Autoscaling for GPU intensive workloads

by Andrei Mihai, Adobe & Flavius Petruca, SERA Intelligence

📍 Atlas 2 AI / ML Intermediate

14:45 – 15:15

As Generative AI systems grow in scale and complexity, traditional DevOps and CI/CD patterns fall short in handling GPU-intensive, latency-sensitive workloads. Supporting large language and image models in production requires rethinking MLOps around Kubernetes primitives, GPU scheduling, and dynamic autoscaling strategies.

This talk explores how MLOps must evolve to enable efficient autoscaling for GenAI workloads: from model lifecycle differences and inference bottlenecks, to Kubernetes-based scaling mechanisms (HPA, GPU rebalancing, queue based autoscaling), GPU utilization challenges, și trade-offs between cost, performance, and responsiveness.