Let there be chaos. Practical chaos engineering for resilient Kubernetes clusters

by Bogdan Stan, ING Hubs

📍 Atlas 1 Platform Engineering Beginner

17:00 – 17:30

This session introduces a shift from reactive firefighting to proactive resilience, helping you move from responding to outages to preventing them.

We build on Kubernetes for scale, but how can we be sure our clusters withstand a network partition, a “noisy neighbour,” or a pod that simply refuses to stay alive?

We’ll start with the “Theory of chaos,” outlining steady states and hypotheses, before moving into the “controlled destruction” phase. From there, we’ll run through three live experiments designed to test common architectural weaknesses:

Network chaos: Simulating latency and packet loss to test timeout configurations.
Pod faults: Testing service discovery and auto‑scaling under sudden pods failure.
Stress scenarios: Forcing CPU/Memory exhaustion to validate resource limits and monitoring setups.

Finally, by using Chaos Mesh to minimise the blast radius, we’ll learn how to uncover vulnerabilities on our own terms.