GitOps with ArgoCD & Flux

Drift, Rollbacks & Disaster Recovery

22 min Lesson 9 of 30

Drift, Rollbacks & Disaster Recovery

The most underestimated promise of GitOps is not faster deployments — it is the ability to answer two questions that terrify on-call engineers: "How did the cluster get into this state?" and "How do we get it back to a known good state fast?" This lesson is the operational core of GitOps: drift detection, Git-revert rollbacks, and rebuilding an entire cluster from Git after a catastrophic failure.

What Is Configuration Drift?

Drift is any divergence between the desired state declared in Git and the actual live state of a cluster. Drift happens constantly in real organizations, and its causes fall into a predictable set:

Manual kubectl edits: an engineer patches a Deployment directly during an incident. Fast in the moment, invisible in Git.
Admission controller mutations: a Mutating Admission Webhook injects sidecars, resource limits, or labels at admission time. The cluster holds more than Git declared.
Autoscalers: HPA or KEDA changes replica counts. This is intentional drift and must be explicitly excluded from reconciliation.
Helm hook side effects: Jobs and ConfigMaps created by Helm hooks may persist beyond their intended lifetime and diverge from the chart.
CRD version skew: a new Kubernetes upgrade promotes a CRD version, leaving existing resources in a deprecated API group.
Expired or rotated Secrets: an external system pushes a rotated credential directly into a Kubernetes Secret without updating the GitOps repo.

Drift is not always accidental. Intentional live mutations — such as HPA-managed replicas or node-level kernel parameters set by a DaemonSet — must be explicitly excluded from reconciliation using ArgoCD ignoreDifferences rules or Flux spec.ignore patches. Failing to do this causes perpetual "OutOfSync" noise that masks real drift.

Detecting Drift: ArgoCD

ArgoCD computes drift by comparing the live Kubernetes resource manifests (fetched from the API server) against the rendered manifests from Git (after Helm or Kustomize templating). The result is an Application sync status of either Synced or OutOfSync.

Check drift on the CLI:

# View all apps and their sync status
argocd app list

# Inspect the diff for a specific app (what Git has vs. what the cluster has)
argocd app diff my-api --local

# Get full diff including managed fields the cluster added
argocd app diff my-api --hard-refresh

# Example output:
# ===== apps/Deployment my-api/api-service ======
# 104c104
# <   replicas: 3
# ---
# >   replicas: 7

To suppress intentional drift (HPA-managed replicas) so it does not pollute the sync status, add an ignoreDifferences block to the Application manifest:

# argocd-app.yaml — suppress HPA-managed replica drift
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-api
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/myorg/gitops-config
    targetRevision: main
    path: apps/overlays/production/my-api
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas           # managed by HPA — ignore replica count drift
    - group: ""
      kind: Secret
      jsonPointers:
        - /data                    # externally rotated secrets — ignore data diff
  syncPolicy:
    automated:
      prune: true
      selfHeal: true              # auto-correct un-ignored drift
    syncOptions:
      - RespectIgnoreDifferences=true

Detecting Drift: Flux

Flux surfaces drift through its Kustomization and HelmRelease objects. The Ready condition with a reason of ReconciliationFailed or DriftDetected indicates the cluster has deviated from Git.

# Check Flux kustomization status
flux get kustomizations --all-namespaces

# Detailed status including last drift message
flux get kustomization my-api -n flux-system -o yaml | grep -A10 conditions

# Force an immediate reconciliation cycle
flux reconcile kustomization my-api --with-source

# Suspend reconciliation to safely perform manual emergency ops
flux suspend kustomization my-api

# Resume reconciliation after emergency is resolved
flux resume kustomization my-api

The drift detection and reconciliation lifecycle: the agent continuously compares Git state against live state and self-heals any un-ignored divergence.

Git-Revert Rollbacks: The Right Way

In GitOps, a rollback is not an argocd app rollback command (though that exists). The canonical rollback is a Git revert — because any durable change to the cluster must exist as a commit in Git. A rollback that bypasses Git is just more drift.

The standard rollback workflow:

# 1. Identify the bad commit (the one that introduced the broken change)
git log --oneline --graph origin/main

# Example output:
# * f3a9c21 (HEAD, origin/main) chore: bump api-service to v2.3.1
# * 8d4b10e feat: increase replicas to 5 in production overlay
# * c1e77f3 fix: update HPA maxReplicas to 20
# * a9f2b8c feat: bump api-service to v2.3.0

# 2. Revert the bad commit — creates a NEW commit, preserves history
git revert f3a9c21 --no-edit
git push origin main

# 3. Force an immediate sync (instead of waiting for the poll interval)
# ArgoCD:
argocd app sync my-api

# Flux:
flux reconcile kustomization my-api --with-source

# 4. Monitor the rollout
kubectl rollout status deployment/api-service -n production
kubectl rollout history deployment/api-service -n production

Never use git reset --hard + force push to roll back on main. Force-pushing rewrites history, which breaks all in-flight PRs, confuses the GitOps agent's commit pointer, and destroys the audit trail. Always use git revert. The extra revert commit is the audit trail — it shows exactly when the rollback happened and who triggered it. At Google and Stripe, force pushes to production config branches are blocked at the repository level.

For multi-service rollbacks where a release touches several repos, use ArgoCD ApplicationSet or Flux HelmRelease chart version pinning to roll back in a single coordinated commit:

# Multi-service coordinated rollback via Kustomize image patch revert
# In your overlay kustomization.yaml, revert the image tags for all affected services:

# git diff HEAD~1 apps/overlays/production/kustomization.yaml
# -  - name: ghcr.io/myorg/api-service
# -    newTag: v2.3.1
# +  - name: ghcr.io/myorg/api-service
# +    newTag: v2.3.0
# -  - name: ghcr.io/myorg/worker-service
# -    newTag: v1.9.0
# +  - name: ghcr.io/myorg/worker-service
# +    newTag: v1.8.5

# One git revert commit rolls back both services atomically
git revert HEAD --no-edit
git push origin main

# Watch ArgoCD pick it up and sync both apps simultaneously
argocd app list --output wide

ArgoCD Application Rollback (Break-Glass)

ArgoCD maintains a local cache of previously deployed manifests (up to spec.revisionHistoryLimit entries, default 10). In a true emergency where Git is unreachable or you need to roll back faster than a Git round trip allows, you can roll back to a cached history entry:

# List revision history for an app
argocd app history my-api

# ID  DATE                           REVISION
# 0   2025-11-01 03:12:11 +0000 UTC  a9f2b8c
# 1   2025-11-02 11:44:05 +0000 UTC  8d4b10e
# 2   2025-11-03 14:22:33 +0000 UTC  f3a9c21  <-- broken

# Roll back to revision 1 (bypasses Git for speed — break-glass only)
argocd app rollback my-api 1

# CRITICAL: immediately follow up with a git revert so Git matches the cluster again
# Otherwise ArgoCD will re-sync to the broken commit on the next cycle

ArgoCD app rollback creates intentional drift. The cluster is now running the state from revision 1, but Git still has the broken commit as HEAD. If auto-sync is enabled, ArgoCD will re-apply the broken state within minutes. After using app rollback as a break-glass, you must immediately disable auto-sync (argocd app set my-api --sync-policy none) and then push a Git revert to align the repo. This two-step process is the only safe emergency protocol.

Disaster Recovery: Rebuilding a Cluster from Git

The ultimate GitOps promise: a cluster is destroyed (bad cloud region, catastrophic upgrade, ransomware). You have Git and a fresh cluster. Recovery time is determined by how well you designed your GitOps repo, not by who remembers the kubectl commands they ran six months ago.

A production-grade DR procedure using ArgoCD:

# --- STEP 1: Bootstrap ArgoCD onto the new cluster ---
# Install ArgoCD into the fresh cluster
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Wait for ArgoCD to be ready
kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s

# --- STEP 2: Restore ArgoCD secrets (repository credentials, cluster credentials)
# These are stored outside Git (e.g., Vault, AWS Secrets Manager, sealed-secrets backup)
# Example: restore from a Vault-backed External Secrets Operator
kubectl apply -f bootstrap/argocd-repo-secret.yaml   # pre-created from Vault

# --- STEP 3: Apply the root App-of-Apps manifest
# This single manifest tells ArgoCD about all other Applications
kubectl apply -f clusters/production/argocd-apps.yaml -n argocd

# --- STEP 4: Watch ArgoCD reconcile the entire cluster from Git
argocd app wait --all --health --timeout 600

# ArgoCD will now:
# 1. Read every Application declared in argocd-apps.yaml
# 2. Render each app's Helm/Kustomize manifests from Git
# 3. Apply them to the cluster in dependency order
# 4. Report health for each app

# --- STEP 5: Verify cluster state matches Git
argocd app list
# NAME              CLUSTER  NAMESPACE  SYNC-STATUS  HEALTH-STATUS
# my-api            in-cluster  production  Synced       Healthy
# payment-service   in-cluster  production  Synced       Healthy
# ...

The equivalent for Flux-managed clusters:

# --- Flux DR: bootstrap onto a fresh cluster ---

# Step 1: Install the Flux CLI
curl -s https://fluxcd.io/install.sh | sudo bash

# Step 2: Bootstrap Flux — points it at the same GitOps repo
# Flux installs itself and creates its own GitRepository + Kustomization objects
flux bootstrap github \
  --owner=myorg \
  --repository=gitops-config \
  --branch=main \
  --path=clusters/production \
  --personal=false \
  --token-auth

# Flux will now self-install, then pull the cluster/production path from Git,
# which in turn references all app Kustomizations & HelmReleases.
# The entire cluster rebuilds automatically from Git state.

# Step 3: Monitor progress
flux get all --all-namespaces

# Step 4: Restore secrets from Vault / AWS SM / SOPS backup (see lesson 8)
# If using SOPS: the age key or AWS KMS key grants Flux access automatically
# If using ESO (External Secrets Operator): it bootstraps via its own HelmRelease in Git

RTO (Recovery Time Objective) in GitOps DR depends on three factors:

Secrets bootstrap time: if secrets require manual intervention to restore, your RTO is hours. Automate this with a Vault+ESO pattern or SOPS with a KMS-backed key.
Image pull time: pulling hundreds of container images on a fresh cluster node pool takes time. A regional ECR/GCR mirror in your DR region cuts this dramatically.
Dependency ordering: if CRDs are not applied before the resources that use them, ArgoCD/Flux will fail reconciliation loops. Use Sync Waves (ArgoCD) or Kustomization dependencies (Flux) to enforce ordering in Git.

Continuous Drift Monitoring in Production

At big-tech scale, you do not wait for an engineer to notice an OutOfSync badge in the ArgoCD UI. You export drift metrics and alert on them:

# ArgoCD exports Prometheus metrics at :8082/metrics
# Key metric: argocd_app_info{sync_status="OutOfSync"} > 0

# Example PrometheusRule to alert on persistent drift
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gitops-drift-alerts
  namespace: monitoring
spec:
  groups:
    - name: gitops.drift
      rules:
        - alert: AppOutOfSyncFor5Minutes
          expr: |
            argocd_app_info{sync_status="OutOfSync"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "ArgoCD app {{ $labels.name }} has been OutOfSync for 5 minutes"
            description: "Check for manual drift or a failing sync. Run: argocd app diff {{ $labels.name }}"
        - alert: AppDegradedHealth
          expr: |
            argocd_app_info{health_status=~"Degraded|Unknown"} == 1
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "ArgoCD app {{ $labels.name }} health is {{ $labels.health_status }}"

Production alert thresholds: alert on OutOfSync after 5 minutes (not immediately — brief transient drift during rolling updates is normal). Alert on Degraded health after 3 minutes. Page on-call if an app is OutOfSync for more than 15 minutes — that almost certainly means a broken sync that auto-heal is failing to fix, and a human needs to look at the diff.

Drift detection, clean rollbacks via Git revert, and a tested DR runbook that rebuilds from Git are the three operational pillars that separate a mature GitOps platform from a demo. The next lesson closes the tutorial with a capstone project: designing and implementing a complete GitOps delivery pipeline end-to-end.