Advanced Kubernetes Operations

etcd & Control Plane Operations

18 min Lesson 7 of 30

etcd & Control Plane Operations

etcd is the single source of truth for every Kubernetes cluster. Every object you create — Pod, Deployment, Secret, ConfigMap — is persisted as a key/value entry in etcd before any component acts on it. If etcd is unhealthy, the entire control plane stalls: the API server refuses writes, the scheduler cannot place pods, and the controller manager cannot reconcile state. Understanding how to monitor, back up, and recover etcd — and how the API server is configured to talk to it — separates engineers who truly operate Kubernetes in production from those who only consume it.

How etcd Fits Into the Control Plane

The Kubernetes control plane has four main components: kube-apiserver, kube-controller-manager, kube-scheduler, and etcd. Only the API server talks directly to etcd — all other components communicate through the API server. This is a critical design point: you never query etcd directly for operational data; you use kubectl or the Kubernetes API.

The API server is the sole etcd client. All other control-plane components and external clients go through the API server.

etcd Health Monitoring

The etcdctl CLI is the standard tool for querying etcd health. You must set the correct endpoints and TLS certificate paths — these are found in the etcd pod manifest (/etc/kubernetes/manifests/etcd.yaml on kubeadm clusters) or as environment variables in the etcd container.

# Export common etcdctl env vars for a kubeadm cluster
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key

# Check overall cluster health
etcdctl endpoint health --cluster

# Check per-member status (shows leader, raft term, DB size)
etcdctl endpoint status --cluster -w table

# List all members
etcdctl member list -w table

# Watch key activity in real time (useful during incident investigation)
etcdctl watch --prefix /registry/pods/default

The endpoint status output shows two important DB-size fields: DB SIZE (bytes currently stored) and DB SIZE IN USE. A large gap between them indicates accumulated MVCC revisions that have not been compacted — a common source of slow writes and eventual disk exhaustion.

Run compaction and defragmentation regularly. Kubernetes does not compact etcd automatically by default (unless --auto-compaction-mode is set on the etcd process). In production, compact every hour and defrag weekly during a low-traffic window. The API server flag --etcd-compaction-interval (default 5m) triggers compaction from the API server side, but explicit etcdctl defrag is still needed to reclaim disk space.

# Compact to the current revision (reclaim MVCC history)
REV=$(etcdctl endpoint status --cluster -w json \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(max(s['Status']['header']['revision'] for s in d))")
etcdctl compact $REV

# Defragment (one member at a time — this causes a brief pause per member)
etcdctl defrag --cluster
# Or target a single endpoint to avoid hitting the leader under load:
etcdctl defrag --endpoints=https://192.168.1.11:2379

etcd Backups — The Most Critical Operational Task

A snapshot is a point-in-time backup of the entire etcd database. Without a recent snapshot, a complete etcd data loss means the cluster cannot be recovered without redeploying every workload from scratch. All production clusters must have automated snapshots.

Snapshot from a non-leader if possible. Taking a snapshot from the leader under write load can cause a brief latency spike for API server requests. Prefer pointing etcdctl snapshot save at a follower endpoint. The snapshot is still fully consistent — etcd linearizes reads across the cluster before serving them.

# Save a snapshot to a local file
etcdctl snapshot save /backup/etcd/etcd-snapshot-$(date +%Y%m%dT%H%M%S).db

# Verify the snapshot (prints hash, revision, total keys)
etcdctl snapshot status /backup/etcd/etcd-snapshot-2025-06-11T020000.db -w table

# --- Restore from snapshot (run on each control-plane node separately) ---
# Stop the API server and controller-manager first (remove their manifests or stop kubelet)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/

etcdctl snapshot restore /backup/etcd/etcd-snapshot-2025-06-11T020000.db \
  --name etcd-0 \
  --initial-cluster "etcd-0=https://192.168.1.10:2380,etcd-1=https://192.168.1.11:2380,etcd-2=https://192.168.1.12:2380" \
  --initial-cluster-token etcd-cluster-prod \
  --initial-advertise-peer-urls https://192.168.1.10:2380 \
  --data-dir /var/lib/etcd-restored

# Point the etcd manifest at the new data dir, then restore control-plane manifests
sed -i 's|/var/lib/etcd|/var/lib/etcd-restored|g' /etc/kubernetes/manifests/etcd.yaml
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/

Automate snapshots with a CronJob in the cluster itself (for managed clusters) or a cron script on the control-plane host (for self-managed). Store snapshots in an object store (S3, GCS) with versioning enabled and a 30-day retention policy. Never store backups only on the same host as etcd.

Key API Server Flags

The API server's startup flags govern security, performance, and etcd connectivity. On kubeadm clusters these live in /etc/kubernetes/manifests/kube-apiserver.yaml. On managed clusters (EKS, GKE) the provider manages them and only exposes a subset for override.

--etcd-servers — comma-separated list of etcd endpoints. Keep all three members listed so the API server can reconnect on leader election.
--etcd-cafile, --etcd-certfile, --etcd-keyfile — TLS certificates for the API server's identity toward etcd. Rotate before expiry.
--etcd-compaction-interval — how often the API server triggers compaction (default 5m). Leave this enabled.
--request-timeout — global request timeout for non-watch operations (default 1m60s). Heavy LIST calls on large clusters may need this increased.
--max-requests-inflight / --max-mutating-requests-inflight — throttle concurrency to protect etcd. The defaults (400 / 200) are appropriate for clusters below ~1,000 nodes; scale proportionally above that.
--audit-log-path / --audit-policy-file — structured audit logging; required for SOC 2 / PCI compliance. Log to a file and ship with Fluentd or a sidecar.
--feature-gates — enable alpha/beta features. Treat these as unstable; never enable in production without testing in staging for a full release cycle.

Managed vs. self-hosted control planes. On EKS, GKE, and AKS, AWS/Google/Azure run the API server and etcd — you do not have SSH access to control-plane nodes and cannot edit manifests directly. The cloud provider handles HA, backups, and patches. On self-managed clusters (kubeadm, Talos, Rancher RKE2) you own every flag and every backup. This is a major operational difference: managed clusters trade control for operational simplicity, while self-managed clusters give full power at the cost of full responsibility.

Production etcd Topology

A production etcd cluster should always have an odd quorum of members — 3 for most clusters, 5 for very large or high-traffic clusters. A 3-member cluster tolerates one member failure; a 5-member cluster tolerates two. Going to 7 members provides no practical benefit and increases write latency because the leader must wait for a majority acknowledgment.

Place etcd members on dedicated nodes with fast local NVMe SSDs — etcd is extremely sensitive to disk write latency. A p99 fsync time above 10ms will cause leader elections, and above 100ms will cause widespread API server errors. Monitor the etcd_disk_wal_fsync_duration_seconds Prometheus metric and alert at the 99th percentile exceeding 10ms.

Separate etcd disks from the OS disk. A noisy process filling the OS disk will starve etcd WAL writes and trigger elections. On cloud VMs, attach a dedicated EBS gp3 volume (at least 3,000 IOPS baseline, with burst to 16,000 IOPS) mounted at /var/lib/etcd. On bare metal, use a dedicated NVMe partition.