etcd & Control Plane Operations
etcd & Control Plane Operations
etcd is the single source of truth for every Kubernetes cluster. Every object you create — Pod, Deployment, Secret, ConfigMap — is persisted as a key/value entry in etcd before any component acts on it. If etcd is unhealthy, the entire control plane stalls: the API server refuses writes, the scheduler cannot place pods, and the controller manager cannot reconcile state. Understanding how to monitor, back up, and recover etcd — and how the API server is configured to talk to it — separates engineers who truly operate Kubernetes in production from those who only consume it.
How etcd Fits Into the Control Plane
The Kubernetes control plane has four main components: kube-apiserver, kube-controller-manager, kube-scheduler, and etcd. Only the API server talks directly to etcd — all other components communicate through the API server. This is a critical design point: you never query etcd directly for operational data; you use kubectl or the Kubernetes API.
etcd Health Monitoring
The etcdctl CLI is the standard tool for querying etcd health. You must set the correct endpoints and TLS certificate paths — these are found in the etcd pod manifest (/etc/kubernetes/manifests/etcd.yaml on kubeadm clusters) or as environment variables in the etcd container.
The endpoint status output shows two important DB-size fields: DB SIZE (bytes currently stored) and DB SIZE IN USE. A large gap between them indicates accumulated MVCC revisions that have not been compacted — a common source of slow writes and eventual disk exhaustion.
--auto-compaction-mode is set on the etcd process). In production, compact every hour and defrag weekly during a low-traffic window. The API server flag --etcd-compaction-interval (default 5m) triggers compaction from the API server side, but explicit etcdctl defrag is still needed to reclaim disk space.
etcd Backups — The Most Critical Operational Task
A snapshot is a point-in-time backup of the entire etcd database. Without a recent snapshot, a complete etcd data loss means the cluster cannot be recovered without redeploying every workload from scratch. All production clusters must have automated snapshots.
etcdctl snapshot save at a follower endpoint. The snapshot is still fully consistent — etcd linearizes reads across the cluster before serving them.
Automate snapshots with a CronJob in the cluster itself (for managed clusters) or a cron script on the control-plane host (for self-managed). Store snapshots in an object store (S3, GCS) with versioning enabled and a 30-day retention policy. Never store backups only on the same host as etcd.
Key API Server Flags
The API server's startup flags govern security, performance, and etcd connectivity. On kubeadm clusters these live in /etc/kubernetes/manifests/kube-apiserver.yaml. On managed clusters (EKS, GKE) the provider manages them and only exposes a subset for override.
--etcd-servers— comma-separated list of etcd endpoints. Keep all three members listed so the API server can reconnect on leader election.--etcd-cafile,--etcd-certfile,--etcd-keyfile— TLS certificates for the API server's identity toward etcd. Rotate before expiry.--etcd-compaction-interval— how often the API server triggers compaction (default 5m). Leave this enabled.--request-timeout— global request timeout for non-watch operations (default 1m60s). Heavy LIST calls on large clusters may need this increased.--max-requests-inflight/--max-mutating-requests-inflight— throttle concurrency to protect etcd. The defaults (400 / 200) are appropriate for clusters below ~1,000 nodes; scale proportionally above that.--audit-log-path/--audit-policy-file— structured audit logging; required for SOC 2 / PCI compliance. Log to a file and ship with Fluentd or a sidecar.--feature-gates— enable alpha/beta features. Treat these as unstable; never enable in production without testing in staging for a full release cycle.
Production etcd Topology
A production etcd cluster should always have an odd quorum of members — 3 for most clusters, 5 for very large or high-traffic clusters. A 3-member cluster tolerates one member failure; a 5-member cluster tolerates two. Going to 7 members provides no practical benefit and increases write latency because the leader must wait for a majority acknowledgment.
Place etcd members on dedicated nodes with fast local NVMe SSDs — etcd is extremely sensitive to disk write latency. A p99 fsync time above 10ms will cause leader elections, and above 100ms will cause widespread API server errors. Monitor the etcd_disk_wal_fsync_duration_seconds Prometheus metric and alert at the 99th percentile exceeding 10ms.
gp3 volume (at least 3,000 IOPS baseline, with burst to 16,000 IOPS) mounted at /var/lib/etcd. On bare metal, use a dedicated NVMe partition.