We are still cooking the magic in the way!
Probes: Liveness, Readiness & Startup
Probes: Liveness, Readiness & Startup
Kubernetes cannot read your application's mind. It knows a container is "running" the moment the process starts — but "running" and "healthy" are very different things. A Java service might take 45 seconds to warm its caches. A Go binary might deadlock without crashing. A web server might start accepting traffic before the database connection pool is ready. Probes are the mechanism Kubernetes uses to distinguish a healthy, ready pod from one that is broken, initialising, or temporarily overloaded.
Getting probes wrong is one of the most common causes of production incidents in Kubernetes. Either they are too aggressive (causing healthy pods to be killed in a traffic spike) or too lenient (routing traffic to a pod that cannot serve requests). This lesson covers the three probe types, their failure modes, and what big-tech SRE teams actually configure.
The Three Probe Types
- Liveness probe — answers: "Is this container still alive?" If it fails, kubelet kills the container and restarts it according to the pod's
restartPolicy. - Readiness probe — answers: "Is this container ready to receive traffic?" If it fails, the pod is removed from the endpoints of every matching Service (traffic stops flowing to it) but it is not restarted.
- Startup probe — answers: "Has this container finished its slow startup?" While it is pending, liveness and readiness probes are suspended. It fires once at launch, and once it succeeds it never fires again.
Probe Mechanisms
All three probe types support three check mechanisms:
httpGet— kubelet sends an HTTP GET. Any 2xx or 3xx is success. Use this for HTTP services; it also tests the HTTP stack itself.tcpSocket— kubelet opens a TCP connection. Success = port open. Use for non-HTTP protocols (gRPC, Redis, Postgres).exec— kubelet runs a command inside the container. Exit code 0 = success. Use when you need application-level checks (e.g. a RedisPINGviaredis-cli). Avoid for high-frequency probes:execforks a new process every tick.grpc(k8s ≥ 1.24) — kubelet calls the gRPC health checking protocol. Ideal for gRPC-native services.
Tuning Parameters That Matter
initialDelaySeconds— how long to wait after container start before the first check. Required if you have no startup probe.periodSeconds— how often to check. Default 10s.timeoutSeconds— how long kubelet waits for a response. Default 1s — dangerously short for anything hitting a database.failureThreshold— consecutive failures before the probe is considered failed. Default 3.successThreshold— consecutive successes to flip from failed → success. Must be 1 for liveness and startup.
A Production-Grade Manifest
Below is a realistic Deployment manifest for a Spring Boot service with a 40-second warm-up. It uses all three probe types correctly:
The Startup Probe: Why It Exists
Before startup probes existed (k8s < 1.16), teams used a large initialDelaySeconds on the liveness probe to cover slow startup. The problem: if a pod deadlocked during startup, kubelet would not detect it until the delay expired. Startup probes solve this cleanly — they grant a generous startup window while still detecting post-startup deadlocks promptly.
The formula is: max startup time = failureThreshold × periodSeconds. Set this to 150–200% of your measured P99 cold-start time. For the manifest above: 6 × 10s = 60s.
Common Production Failure Modes
- Liveness probe hitting the database — if the DB is slow, the probe times out, kubelet kills healthy pods, and you get a crash loop that amplifies the DB load. Liveness must check only the process itself.
- timeoutSeconds: 1 (the default) on a readiness probe — a 200ms P99 endpoint can occasionally take 2s under GC pressure. One timeout counts as a failure. With
failureThreshold: 3that is only 3s of latency before you start shedding the pod. SettimeoutSecondsto at least your P99 × 3. - No startup probe + small initialDelaySeconds on a slow JVM — kubelet fires liveness before the JVM finishes loading, sees a failure, and kills the container. The pod enters a crash loop and never starts. Adding a startup probe with an adequate
failureThresholdis the fix. - Readiness endpoint that never fails — a readiness probe that always returns 200 even under overload provides no protection; traffic continues routing to a saturated pod. Implement back-pressure logic (e.g. return 503 when the request queue depth exceeds a threshold).
failureThreshold failures it kills a pod that is perfectly healthy, just waiting for a downstream dependency. The database outage now also causes a pod restart storm.
Inspecting Probe Status
Use kubectl describe pod <name> and look at the Events section and the Containers block. Probe failures appear as Warning events with reason Unhealthy. The kubectl get pod READY column (e.g. 0/1) reflects readiness probe state.
Exec Probe for Non-HTTP Services
For a Redis sidecar or a database pod, an exec probe is idiomatic:
timeoutSeconds < 2, or liveness probes that query external dependencies. Probes are treated as a reliability contract, not an afterthought.