Debugging Pods & Workloads
Debugging Pods & Workloads
When something breaks in a Kubernetes cluster, the blast radius is often invisible at first. A Pod shows CrashLoopBackOff on the dashboard, traffic to a Service silently drops, or a Deployment stalls at 2 of 3 desired replicas and never converges. Unlike a process running on a server you can SSH into and observe directly, Kubernetes abstracts the runtime behind several layers — the API server, the kubelet, the container runtime, the overlay network. Effective debugging means knowing exactly which layer to interrogate and which kubectl command surfaces that layer.
This lesson gives you the full systematic toolkit used by SREs at companies operating large Kubernetes fleets: the four canonical debugging commands, how to read each output, the three most common pod failure states and their root causes, and the structured mental model that gets you from alert to fix in minutes rather than hours.
The Debugging Hierarchy
Always work from the broadest view down to the narrowest: cluster events → object description → container logs → live shell. Jumping straight to logs before checking events is the most common mistake — you miss the scheduler decision, the image pull failure, or the probe rejection that explains everything.
kubectl get events — The Cluster Audit Trail
Events are Kubernetes' own structured log of everything that happened to every object in a namespace. Unlike container logs (which only exist while the container runs), events are written by the control plane — the scheduler, the kubelet, the deployment controller — and persist for one hour by default. When a Pod fails immediately before you can read its logs, events are often the only record of why.
Reason field (e.g. FailedScheduling, BackOff, Pulled, Started, Killing) and a Message field with human-readable detail. The Reason alone often tells you which layer is broken — scheduler, image pull, probe, or OOM killer.kubectl describe — Full Object State + Event History
kubectl describe pod <name> is the single richest debugging command. It renders the full object spec merged with live status fields and appends the event stream scoped to that object. Learn to read it in sections:
- Status / Phase:
Pending,Running,Succeeded,Failed,Unknown— the coarse signal. - Conditions: Boolean flags like
PodScheduled,Initialized,ContainersReady,Ready. IfPodScheduled=False, the problem is in scheduling, not in your image. - Containers → State:
Waiting(with a Reason),Running, orTerminated(with ExitCode). Exit code 137 = OOM kill; 1 = application error; 0 = clean exit that should not have happened. - Containers → Last State: The previous container run — critical for CrashLoopBackOff; shows the exit code from the last crash.
- Events: Scoped to this Pod — shows image pull progress, probe failures, scheduling decisions.
kubectl logs — Reading Container Output
kubectl logs fetches stdout and stderr from the container runtime. When a Pod has crashed, pass --previous to read the logs of the dead container rather than the current (empty) one.
kubectl logs reads from the node's local container log files (/var/log/pods/). When a Pod is evicted or the node is drained, those files can disappear. In production, always ship logs to a centralised store (Loki, OpenSearch, Datadog) — kubectl logs is for fast triage, not the primary log archive.kubectl exec — Live Shell Inside a Running Container
kubectl exec opens a process inside an already-running container. Use it to inspect the filesystem, test network reachability from inside the Pod's network namespace, or validate environment variables and mounted secrets that your app will actually see.
kubectl exec -- bash will fail with "OCI runtime exec failed." Use an ephemeral debug container instead: kubectl debug -it <pod-name> --image=busybox --target=<container-name>. This injects a debug sidecar that shares the target container's process namespace without modifying the running Pod spec.Common Pod Failure States
ImagePullBackOff (and ErrImagePull)
The kubelet cannot pull the container image from the registry. ErrImagePull is the first attempt; after several retries with exponential back-off (5 s, 10 s, 20 s … capped at 5 min), the state becomes ImagePullBackOff. Root causes, in order of frequency:
- Tag does not exist — a typo in the image tag, or a CI pipeline that failed to push the new tag before the Deployment was updated.
- Wrong or missing imagePullSecret — the registry requires authentication (ECR, GCR, GHCR, private Docker Hub) and the Pod spec does not reference a valid secret, or the secret is in the wrong namespace.
- Registry rate limit — Docker Hub's anonymous pull limit (100/6 h) hit by a cluster with many nodes all pulling the same image without credentials.
- Network policy or firewall — the node cannot reach the registry endpoint (common in air-gapped or VPC-restricted environments).
CrashLoopBackOff
The container starts, runs briefly, then exits with a non-zero code (or sometimes zero). Kubernetes restarts it. The restart loop continues with exponential back-off (10 s, 20 s, 40 s … capped at 5 min). After enough crashes the state stabilises as CrashLoopBackOff, which is Kubernetes communicating: "I keep trying, and it keeps failing." Root causes:
- Application startup error — the process cannot connect to a database, reads a required environment variable that is missing or has the wrong value, or fails a startup validation check.
- Liveness probe misconfiguration — the probe threshold is too aggressive (e.g.
initialDelaySeconds: 0on a slow-starting Java service). Kubernetes kills the container before it finishes starting, triggering a loop. - OOM kill on startup — the memory limit is too low for the process to initialise. Check for exit code 137 in
Last State. - Missing configuration — a ConfigMap or Secret the app depends on is not mounted, or was mounted at a path the app does not expect.
Pending — The Pod That Never Starts
A Pod stuck in Pending has been accepted by the API server but the scheduler has not placed it. The event reason is almost always FailedScheduling. The message will tell you exactly which constraint failed:
- Insufficient CPU / memory — no node has enough unallocated resources. Solution: scale the node group, reduce requests, or check for forgotten Pods consuming resources.
- No nodes match node selector or affinity — a
nodeSelectorrequires a label that no node has. - PersistentVolumeClaim not bound — the Pod requires a PVC that is in
Pendingstate (no matching PV, or the StorageClass is wrong). - Taint not tolerated — all available nodes have a taint the Pod does not tolerate.
kubectl get pod -o wide: Always pass -o wide when reviewing Pods in bulk. It adds the Node column, the Pod IP, and the nominated node for Pending Pods. In a multi-node cluster, seeing all replicas of a service land on the same node immediately signals an anti-affinity rule that is missing.Reading the Full Picture: a Structured Runbook
When an alert fires or a user reports a 503, use this sequence every time — no improvising:
- Run
kubectl get pods -n <ns>— identify which Pods are notRunning 1/1. - Run
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30— cluster-level signal. - Run
kubectl describe pod <name>— read Conditions, Container State, Last State, Events. - Run
kubectl logs <name> --previousif the container crashed;kubectl logs -f <name>if running but misbehaving. - Run
kubectl exec -it <name> -- shto validate connectivity and config from inside the Pod's network namespace. - If the Pod is healthy but the Service is dropping traffic, inspect the Endpoints:
kubectl get endpoints <service-name>— empty endpoints mean the label selector does not match any running Pod.
Every step surfaces a different layer. Skipping any one of them risks chasing the wrong hypothesis for an hour.