Kubernetes Fundamentals

Cluster Architecture

18 min Lesson 2 of 32

Cluster Architecture

A Kubernetes cluster is split into two planes that serve completely different roles: the control plane, which is the brain, and the data plane (worker nodes), which is the muscle. Every production outage, every performance problem, and every scaling decision ultimately traces back to one of these components. Understanding what each piece does — and why it was designed that way — is what separates engineers who operate Kubernetes from engineers who merely use it.

Kubernetes Cluster Architecture Control Plane API Server kube-apiserver — single entry point etcd cluster state store Scheduler kube-scheduler Controller Manager kube-controller-manager — reconciliation loops Cloud Controller Manager cloud-specific resources (LB, volumes, nodes) Worker Node 1 kubelet node agent kube-proxy iptables / eBPF containerd container runtime Pod app sidecar shared network + storage Worker Node 2 kubelet node agent kube-proxy iptables / eBPF containerd container runtime Pod app worker shared network + storage — — — API Server watches/pushes to kubelets
Kubernetes cluster: the control plane manages cluster state; worker nodes run workloads.

The Control Plane

The control plane is a set of processes that implement the Kubernetes API and maintain the desired state of the cluster. In managed offerings (GKE, EKS, AKS) you never see these VMs — the cloud vendor runs them for you and gives you an SLA. In self-managed clusters (kubeadm, Talos, k3s) you own every one of them.

API Server (kube-apiserver)

The API server is the only component that reads from and writes to etcd. Everything else — kubelets, controllers, your kubectl — talks to the API server over HTTPS. It validates every incoming request against the schema, runs admission webhooks (OPA/Gatekeeper, LimitRanger, ResourceQuota), and then persists the object.

This single-entry-point design means the API server is your cluster's most critical component. In high-availability control planes you run three or five replicas behind a load balancer. At Google scale, the API server is horizontally partitioned by resource type, but you will not see that configuration until you are operating clusters with tens of thousands of nodes.

Why everything goes through the API server: Any component that bypassed the API server and wrote directly to etcd would skip admission control, RBAC, and audit logging. The API server is the enforcement boundary for all cluster policy.

etcd

etcd is a strongly consistent, distributed key-value store based on the Raft consensus algorithm. It holds the entire cluster state: every Pod spec, every Service definition, every Secret, every ConfigMap. A quorum of etcd members must agree before any write is acknowledged, which is why you always run an odd number of members (3 or 5) in production — a cluster of 3 can tolerate 1 failure; a cluster of 5 can tolerate 2.

etcd is not a database for your application data. It is sized for metadata, not large blobs. Values over 1.5 MB (the default gRPC message limit) will be rejected. Operators routinely compact and defragment etcd on a schedule to keep its database file from growing without bound.

Production pitfall — etcd disk latency: etcd uses fsync after every write. On a cloud VM with a slow disk, etcd leader elections can cascade into API server timeouts and cascading failures across the cluster. Always run etcd on a volume with low write latency (AWS io2, GCP pd-ssd, or a dedicated NVMe local disk). Monitor the etcd_disk_wal_fsync_duration_seconds Prometheus metric — p99 above 10 ms is a red flag.

Scheduler (kube-scheduler)

The scheduler watches the API server for Pods that have no nodeName assigned (i.e., Pods in the Pending state) and selects the best node for each one. The selection is a two-phase process:

  1. Filtering — eliminates nodes that cannot satisfy the Pod's hard constraints: CPU/memory requests, node selector, taints, affinity/anti-affinity rules, topology spread.
  2. Scoring — ranks the remaining nodes using a weighted set of functions (LeastAllocated, BalancedResourceAllocation, ImageLocality, etc.) and binds the Pod to the highest-scoring node.

The scheduler writes the node name back to the API server as a Binding object; it never directly communicates with the kubelet.

Controller Manager (kube-controller-manager)

This is a single binary that runs dozens of control loops (controllers) in goroutines. Each controller watches one or more resource types and reconciles actual state toward desired state:

  • ReplicaSet controller — ensures the right number of Pod replicas exist.
  • Deployment controller — manages rolling updates and rollbacks via ReplicaSets.
  • Node controller — marks nodes as NotReady when their heartbeats stop and evicts Pods after a configurable grace period (--node-monitor-grace-period, default 40 s).
  • Job controller, CronJob controller, Namespace controller, ServiceAccount controller — and many more.

Cloud Controller Manager

Cloud-specific logic was extracted from kube-controller-manager into its own binary so cloud providers can ship their own implementation independently of the Kubernetes release cycle. It handles: provisioning cloud load balancers for Service type: LoadBalancer, attaching cloud volumes for PersistentVolumeClaims, and syncing node metadata (instance type, zone, region labels).

Worker Nodes

Every worker node runs three components that together form the execution environment for Pods.

kubelet

The kubelet is a long-running agent that registers the node with the API server and reconciles the set of Pods that the scheduler has assigned to it. For each assigned Pod, the kubelet instructs the container runtime to pull images and start containers, then monitors health via liveness and readiness probes. It reports node capacity, allocatable resources, and Pod status back to the API server roughly every 10 seconds (configurable via --node-status-update-frequency).

The kubelet is also the only component that talks directly to the container runtime — via the Container Runtime Interface (CRI), a gRPC API. This abstraction lets you swap out containerd for another CRI-compliant runtime (CRI-O, kata-containers) without changing anything else.

Pro tip — kubelet logs are your first debugging stop: When a Pod is stuck in Pending or ContainerCreating, always check the kubelet log on the assigned node: journalctl -u kubelet -f. Image pull failures, runtime errors, and cgroup issues surface here before they propagate to kubectl describe pod.

kube-proxy

kube-proxy watches Services and Endpoints and programs the node's networking rules to implement virtual IP routing. In most clusters it uses iptables mode — each Service IP gets a chain of DNAT rules that load-balance across healthy Pod IPs. In high-throughput clusters, eBPF mode (via Cilium or kube-proxy replacement) bypasses iptables entirely for lower latency and better scalability.

Container Runtime

The container runtime is responsible for image management and the actual lifecycle of container processes. The de facto standard is containerd (a CNCF graduated project, extracted from Docker). It pulls OCI-compliant images from registries, manages the on-disk layer cache, and calls the lower-level runc (or a sandboxed runtime like gVisor) to create isolated namespaced processes.

# Inspect control-plane component health kubectl get componentstatuses # View all nodes and their roles kubectl get nodes -o wide # Describe a node to see capacity, allocatable resources, taints, and conditions kubectl describe node <node-name> # Check kubelet status on a node (run this on the node via SSH or via kubectl debug) systemctl status kubelet journalctl -u kubelet --since "5 min ago" | tail -40 # List running system Pods in the control-plane namespace (kubeadm clusters) kubectl get pods -n kube-system -o wide
# kubeadm: check etcd health from inside its Pod kubectl -n kube-system exec -it etcd-<control-plane-node> -- \ etcdctl \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # View etcd member list (expect odd count: 3 or 5 in production) kubectl -n kube-system exec -it etcd-<control-plane-node> -- \ etcdctl \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ member list -w table

How the Components Talk at Startup

When you submit kubectl apply -f deployment.yaml, here is the exact sequence:

  1. kubectl serializes your manifest, sends a POST /apis/apps/v1/namespaces/.../deployments to the API server.
  2. API server authenticates (mTLS/OIDC), authorizes (RBAC), runs admission webhooks, validates the schema, and writes the Deployment object to etcd.
  3. Deployment controller (inside controller-manager) watches for new Deployments, creates a ReplicaSet object, writes it to the API server.
  4. ReplicaSet controller creates the required number of Pod objects (with no nodeName) — writes them to the API server.
  5. Scheduler detects the unbound Pods, runs filtering and scoring, writes the node binding.
  6. kubelet on the chosen node detects the Pod is assigned to it, instructs containerd to pull the image and start containers.
  7. kubelet reports Pod status back; the API server updates the object in etcd. kubectl get pods now shows Running.

Every step is an asynchronous watch loop — there are no direct RPC calls between components except through the API server. This is the architecture that lets Kubernetes self-heal: any component can crash and restart, and the reconciliation loops pick up exactly where they left off.