Kubernetes & GitOps in DR
Kubernetes & GitOps in DR
When a region goes dark, the question is not whether your cluster can self-heal — Kubernetes has been doing that for years. The question is whether you can recreate the entire control plane, all its Operators, all its namespaced workloads, all its stateful data, and all its network policy in a different region inside your RTO window, with confidence that the result is identical to what just failed. That question is answered before the incident, not during it — by the quality of your GitOps practice and your backup strategy for stateful workloads.
kubectl apply history or Helm release state stored in-cluster, you cannot reliably recreate it elsewhere. The Git repository is the single source of truth; the cluster is a runtime reflection of that truth. Treat the repo as infrastructure, not documentation.
Rebuilding a Cluster from Git
A well-structured GitOps repository contains everything needed to bootstrap a cluster from zero: Flux or ArgoCD bootstrap manifests, Helm releases for platform components (cert-manager, external-dns, ingress-nginx, Prometheus stack, Velero), Kustomize overlays per environment, and all application workloads. The recovery procedure is deterministic and testable.
At Google and Stripe, the GitOps repo is the authoritative cluster definition. Any drift detected between Git state and live state is treated as a production incident by the reconciler. This discipline is what makes DR viable: you can provision a blank cluster in the DR region and let the GitOps controller converge it to the desired state without any manual steps beyond bootstrapping the controller itself.
Two details matter at production scale. First, your path in the repo should encode the cluster identity (clusters/prod-us-east-1) so the same repo branches into multiple cluster definitions without collision. Second, secrets must not be in Git in plaintext — use Sealed Secrets or SOPS with age/GPG. In a DR scenario, the sealed-secrets controller key or the SOPS key must be available in the new cluster before Flux can decrypt application secrets. Store that key in your cloud KMS (AWS KMS, GCP KMS) and have the bootstrap process retrieve it via an IAM role — never in a human's head or a spreadsheet.
kind or a real cloud cluster provisioned by Terraform, bootstrapped with Flux, checked for convergence, then destroyed. The script that does this is your DR runbook for the cluster layer. If it takes more than 15 minutes to converge, identify the bottleneck — it is usually a slow Helm chart pull or an image pull from a registry that requires credentials not yet present.
Velero: Backup and Restore for Kubernetes Resources and Volumes
Velero (formerly Heptio Ark) is the de-facto standard for Kubernetes backup and restore. It serializes cluster objects — Deployments, Services, ConfigMaps, PVCs, CRDs, RBAC — to object storage (S3, GCS, Azure Blob), and optionally snapshots persistent volumes via CSI snapshot hooks or provider volume plugins. At clusters with hundreds of namespaces, Velero's schedule and include/exclude filters are what keep backup size and restore time manageable.
In large production clusters at companies like Shopify or Cloudflare, Velero backups run against namespaces grouped by criticality tier. Tier-1 namespaces (payments, auth, core APIs) back up hourly; Tier-2 (analytics, batch jobs) back up daily. The restore SLO is verified quarterly via actual restore drills into a staging cluster — not by trusting that the backup file exists.
Stateful Recovery: The Hard Part
Kubernetes stateless workloads are trivially recoverable via GitOps — pod spec, ConfigMap, Service, HPA: all of it is in Git. The challenge is stateful workloads: databases, message queues, and any volume that holds data that is not already replicated to the DR region by the data plane itself.
The industry has three patterns for this, and which you choose depends on your RPO:
- Active-passive with Velero volume backup. Velero snapshots PVCs hourly to S3 with cross-region replication. On failover, Velero restores the latest snapshot into the DR cluster. RPO is tied to snapshot frequency — typically 1 hour. Works well for workloads where an hour of data loss is acceptable (batch processing, analytics, non-financial state). Restore time for a 500 GB volume via
--default-volumes-to-fs-backup(file-system backup via restic/kopia) is 20-40 minutes depending on throughput. - Application-level replication with Velero for config-only restore. Postgres Logical Replication, MySQL GTID replication, or Kafka MirrorMaker 2 keeps data continuously synchronized to the DR region. Velero only backs up Kubernetes objects (the Deployment, Service, Secret, ConfigMap), not the volume. On failover, restore Kubernetes objects from Velero, then promote the already-replicated standby. RPO approaches zero; restore time is minutes. This is what Stripe and similar companies do for financial data.
- Operator-managed HA (Patroni, Vitess, CockroachDB). The database operator handles replication, quorum, and leader election across regions natively. No Velero volume backup needed — the operator promotes the nearest healthy replica. Velero still backs up the CRDs and Operator state so the operator configuration itself can be recreated in a cold-start scenario.
backup.velero.io/backup-volumes-excludes and pre/post hooks via pod annotations — use them.
Validating DR Readiness
A GitOps repo and Velero schedules are necessary but not sufficient. The only thing that validates DR readiness is a successful restore. Netflix runs quarterly "Chaos Kong" exercises that evacuate an entire AWS Availability Zone. At Google, DiRT tests cover complete region-level failures. For most organizations, a monthly restore drill into a throwaway cluster — verifying that all Deployments converge, all PVCs restore, and application health checks pass — is the minimum acceptable practice. Automate it with a script that runs velero restore create, polls until completion, runs smoke tests, then destroys the cluster. Store the pass/fail in your incident management system as evidence of DR readiness.
flux get all -A and note when the last HelmRelease transitions to Ready=True. If that time exceeds 15 minutes, your platform Helm charts are too heavy or your image pull is too slow. Pre-pull critical images into a regional ECR/Artifact Registry cache — many teams maintain a "warm DR cluster" in a hibernated node group to eliminate the cold-start penalty entirely.