Disaster Recovery & Multi-Region

Kubernetes & GitOps in DR

18 min Lesson 7 of 27

Kubernetes & GitOps in DR

When a region goes dark, the question is not whether your cluster can self-heal — Kubernetes has been doing that for years. The question is whether you can recreate the entire control plane, all its Operators, all its namespaced workloads, all its stateful data, and all its network policy in a different region inside your RTO window, with confidence that the result is identical to what just failed. That question is answered before the incident, not during it — by the quality of your GitOps practice and your backup strategy for stateful workloads.

GitOps is not optional for DR — it is the DR plan for the cluster layer. If your cluster configuration lives only in kubectl apply history or Helm release state stored in-cluster, you cannot reliably recreate it elsewhere. The Git repository is the single source of truth; the cluster is a runtime reflection of that truth. Treat the repo as infrastructure, not documentation.

Rebuilding a Cluster from Git

A well-structured GitOps repository contains everything needed to bootstrap a cluster from zero: Flux or ArgoCD bootstrap manifests, Helm releases for platform components (cert-manager, external-dns, ingress-nginx, Prometheus stack, Velero), Kustomize overlays per environment, and all application workloads. The recovery procedure is deterministic and testable.

At Google and Stripe, the GitOps repo is the authoritative cluster definition. Any drift detected between Git state and live state is treated as a production incident by the reconciler. This discipline is what makes DR viable: you can provision a blank cluster in the DR region and let the GitOps controller converge it to the desired state without any manual steps beyond bootstrapping the controller itself.

# Bootstrap Flux v2 on a new DR cluster from an existing GitOps repo
# Pre-requisites: KUBECONFIG points to the NEW empty cluster

export GITHUB_TOKEN=ghp_...        # fine-grained PAT, repo read+write
export GITHUB_USER=my-org
export FLUX_REPO=cluster-gitops

# 1. Install Flux controllers
flux bootstrap github \
  --owner=${GITHUB_USER} \
  --repository=${FLUX_REPO} \
  --branch=main \
  --path=clusters/prod-us-east-1 \      # reuse same path, or use clusters/dr-us-west-2
  --personal

# Flux creates the flux-system namespace, installs source-controller,
# kustomize-controller, helm-controller, notification-controller, and
# commits their manifests back to the repo. From this point, every
# Kustomization and HelmRelease in that path will be applied automatically.

# 2. Monitor reconciliation — watch the cluster converge
flux get all -A --watch

# 3. Check specific kustomizations for errors
flux get kustomizations -A
flux get helmreleases -A

# Expected: within 5-10 min all Helm releases are "Ready True"
# and all application pods are Running.
# If a HelmRelease fails, inspect:
flux logs --kind=HelmRelease --name=my-app --namespace=default

Two details matter at production scale. First, your path in the repo should encode the cluster identity (clusters/prod-us-east-1) so the same repo branches into multiple cluster definitions without collision. Second, secrets must not be in Git in plaintext — use Sealed Secrets or SOPS with age/GPG. In a DR scenario, the sealed-secrets controller key or the SOPS key must be available in the new cluster before Flux can decrypt application secrets. Store that key in your cloud KMS (AWS KMS, GCP KMS) and have the bootstrap process retrieve it via an IAM role — never in a human's head or a spreadsheet.

Test your bootstrap procedure monthly in a throwaway cluster. kind or a real cloud cluster provisioned by Terraform, bootstrapped with Flux, checked for convergence, then destroyed. The script that does this is your DR runbook for the cluster layer. If it takes more than 15 minutes to converge, identify the bottleneck — it is usually a slow Helm chart pull or an image pull from a registry that requires credentials not yet present.

Velero: Backup and Restore for Kubernetes Resources and Volumes

Velero (formerly Heptio Ark) is the de-facto standard for Kubernetes backup and restore. It serializes cluster objects — Deployments, Services, ConfigMaps, PVCs, CRDs, RBAC — to object storage (S3, GCS, Azure Blob), and optionally snapshots persistent volumes via CSI snapshot hooks or provider volume plugins. At clusters with hundreds of namespaces, Velero's schedule and include/exclude filters are what keep backup size and restore time manageable.

# Install Velero with AWS S3 backend (plugin v1.9+ with CSI snapshot support)
# Replace bucket, region, and IAM role ARN for your environment

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --use-node-agent \
  --default-volumes-to-fs-backup \
  --secret-file ./credentials-velero   # AWS creds for the Velero SA

# credentials-velero format:
# [default]
# aws_access_key_id=AKIA...
# aws_secret_access_key=...

# Create a scheduled backup of everything (daily, 30-day retention)
velero schedule create full-cluster \
  --schedule="0 2 * * *" \
  --ttl 720h \
  --include-namespaces "*"

# Per-namespace schedule for critical workloads (hourly, 48-hour retention)
velero schedule create payments-hourly \
  --schedule="0 * * * *" \
  --ttl 48h \
  --include-namespaces payments,fraud-detection

# Trigger an on-demand backup before a risky migration
velero backup create pre-migration-$(date +%Y%m%d-%H%M) \
  --include-namespaces payments \
  --wait

# Restore into a NEW cluster (cross-region DR)
velero restore create --from-backup pre-migration-20250610-1400 \
  --namespace-mappings payments:payments-restored \
  --wait

# Check restore status
velero restore describe pre-migration-20250610-1400-restore-1 --details

In large production clusters at companies like Shopify or Cloudflare, Velero backups run against namespaces grouped by criticality tier. Tier-1 namespaces (payments, auth, core APIs) back up hourly; Tier-2 (analytics, batch jobs) back up daily. The restore SLO is verified quarterly via actual restore drills into a staging cluster — not by trusting that the backup file exists.

GitOps + Velero DR architecture: Git drives cluster configuration convergence; Velero restores stateful data from cross-region S3. Both paths must succeed within the RTO window.

Stateful Recovery: The Hard Part

Kubernetes stateless workloads are trivially recoverable via GitOps — pod spec, ConfigMap, Service, HPA: all of it is in Git. The challenge is stateful workloads: databases, message queues, and any volume that holds data that is not already replicated to the DR region by the data plane itself.

The industry has three patterns for this, and which you choose depends on your RPO:

Active-passive with Velero volume backup. Velero snapshots PVCs hourly to S3 with cross-region replication. On failover, Velero restores the latest snapshot into the DR cluster. RPO is tied to snapshot frequency — typically 1 hour. Works well for workloads where an hour of data loss is acceptable (batch processing, analytics, non-financial state). Restore time for a 500 GB volume via --default-volumes-to-fs-backup (file-system backup via restic/kopia) is 20-40 minutes depending on throughput.
Application-level replication with Velero for config-only restore. Postgres Logical Replication, MySQL GTID replication, or Kafka MirrorMaker 2 keeps data continuously synchronized to the DR region. Velero only backs up Kubernetes objects (the Deployment, Service, Secret, ConfigMap), not the volume. On failover, restore Kubernetes objects from Velero, then promote the already-replicated standby. RPO approaches zero; restore time is minutes. This is what Stripe and similar companies do for financial data.
Operator-managed HA (Patroni, Vitess, CockroachDB). The database operator handles replication, quorum, and leader election across regions natively. No Velero volume backup needed — the operator promotes the nearest healthy replica. Velero still backs up the CRDs and Operator state so the operator configuration itself can be recreated in a cold-start scenario.

Velero file-system backup (restic/kopia) does not guarantee consistency for running databases. If you snapshot a Postgres PVC while Postgres is writing, you may get a torn page. For databases, always use a pre-backup hook to quiesce writes first, or rely on application-level replication and only use Velero for the Kubernetes object layer. Velero supports backup.velero.io/backup-volumes-excludes and pre/post hooks via pod annotations — use them.

# Velero pre/post hooks to quiesce Postgres before PVC snapshot
# Add these annotations to the Postgres pod template in your Deployment/StatefulSet

# Pod annotation for pre-backup hook (freeze writes, checkpoint WAL)
metadata:
  annotations:
    pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "psql -U postgres -c \"CHECKPOINT;\""]'
    pre.hook.backup.velero.io/timeout: "60s"
    post.hook.backup.velero.io/command: '["/bin/bash", "-c", "echo backup done"]'
    post.hook.backup.velero.io/timeout: "30s"
    # Exclude WAL volume if using streaming replication (only back up data vol)
    backup.velero.io/backup-volumes-excludes: pg-wal

---
# For a StatefulSet using CSI snapshots (faster, consistent, storage-level):
# 1. Enable CSI snapshots in your cluster
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v7.0.2/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/v7.0.2/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml

# 2. Create a VolumeSnapshotClass matching your CSI driver
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-ebs-snapclass
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Retain

# 3. Velero will automatically use CSI snapshots for PVCs using that StorageClass
# CSI snapshots are atomic and consistent — preferred over restic for databases

Validating DR Readiness

A GitOps repo and Velero schedules are necessary but not sufficient. The only thing that validates DR readiness is a successful restore. Netflix runs quarterly "Chaos Kong" exercises that evacuate an entire AWS Availability Zone. At Google, DiRT tests cover complete region-level failures. For most organizations, a monthly restore drill into a throwaway cluster — verifying that all Deployments converge, all PVCs restore, and application health checks pass — is the minimum acceptable practice. Automate it with a script that runs velero restore create, polls until completion, runs smoke tests, then destroys the cluster. Store the pass/fail in your incident management system as evidence of DR readiness.

GitOps convergence time is your cluster-layer RTO contribution. Measure it. After bootstrapping Flux on a blank cluster, run flux get all -A and note when the last HelmRelease transitions to Ready=True. If that time exceeds 15 minutes, your platform Helm charts are too heavy or your image pull is too slow. Pre-pull critical images into a regional ECR/Artifact Registry cache — many teams maintain a "warm DR cluster" in a hibernated node group to eliminate the cold-start penalty entirely.