Disaster Recovery & Multi-Region

Failover Mechanics

18 min Lesson 6 of 27

Failover Mechanics

Failover is the operational act of redirecting traffic — or promoting a standby system — when a primary site becomes unhealthy. Done badly, it extends your outage. Done well, it is invisible to customers. This lesson dissects the three interlocking components that make failover reliable at scale: health checks, DNS, and runbooks (or the automated equivalents).

Health Checks: The Signal That Drives Everything

A failover decision is only as good as the signal that triggers it. Every layer of the stack has its own health-check mechanism, and they must be layered — no single check is sufficient in production.

Load balancer checks — ALB/NLB target health checks, HTTP 200 on a shallow endpoint (e.g. /healthz) every 10 s, 3-miss threshold. These gate per-instance traffic, not regional traffic.
Route 53 health checks — send requests every 10 or 30 s from 3+ AWS PoPs worldwide. Threshold: 3 consecutive failures. Type can be HTTP/HTTPS/TCP; for HTTPS the TLS cert is validated. You can chain: a "calculated health check" becomes healthy only when N of M child checks are healthy — useful for "region is healthy if at least 2 of 3 AZs are serving."
Synthetic monitors — CloudWatch Synthetics or Datadog Browser Tests run real browser flows every minute from multiple regions. A failed canary can flip a Route 53 health check via a CloudWatch alarm → EventBridge rule → Route 53 update (Lambda).

Shallow vs. deep health checks: A /healthz that returns 200 because the HTTP server is alive but the DB connection pool is exhausted will mark the instance healthy while it is actually unable to serve requests. Deep checks that exercise the full request path catch this, but they add latency and can themselves become a source of load. The standard compromise: a shallow check for the load balancer (fast, cheap) and a deep check for the regional health check that drives DNS failover (slower, but consequential decisions deserve better data).

DNS Failover Patterns

DNS is the canonical entry point for regional failover because it is the last shared dependency between regions. The two primary Route 53 routing policies used in DR are Failover routing and Health-check-weighted routing.

Route 53 active-passive failover: the primary region serves all traffic while healthy; when health checks fail, Route 53 automatically routes to the secondary.

Key DNS parameters engineers must know in production:

TTL — the single biggest lever. For disaster-recovery records, set TTL to 60 s (some teams go as low as 30 s). At TTL=300 (five minutes), clients cache the old A record for five minutes after Route 53 flips it — that is five minutes of hard outage for every request hitting the stale record. The cost is increased DNS query volume (and thus Route 53 per-query charges), which is negligible compared to the RTO impact.
Evaluate Target Health — for alias records pointing at ALBs, enabling this propagates ALB health into Route 53 without a separate health check resource.
Failback TTL — when the primary recovers, Route 53 will start returning its record. Set a separate, higher TTL for the secondary record so that failback drains traffic gradually rather than swinging back all at once.

# Terraform: Route 53 active-passive failover with health check
resource "aws_route53_health_check" "primary" {
  fqdn              = "api-primary.internal.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/healthz"
  failure_threshold = 3
  request_interval  = 10

  tags = { Name = "primary-region-healthcheck" }
}

resource "aws_route53_record" "api_primary" {
  zone_id = var.hosted_zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60

  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  health_check_id = aws_route53_health_check.primary.id
  records         = [var.primary_alb_ip]
}

resource "aws_route53_record" "api_secondary" {
  zone_id = var.hosted_zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60

  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  # No health_check_id — secondary is always returned when primary fails
  records = [var.secondary_alb_ip]
}

Runbook-Driven Failover

Even if you plan to automate failover eventually, every team should first write — and rehearse — a runbook. Runbooks are the ground truth for what "failover" actually means in your system. They expose hidden dependencies (the Redis sentinel leader, the Kafka MirrorMaker consumer group, the Elasticsearch cross-cluster replication index) that automation would need to handle anyway.

A production-grade DR runbook has five sections:

Decision criteria — explicit conditions that trigger the runbook. Not "the site is slow" but "Route 53 health check has been in UNHEALTHY state for ≥2 minutes AND synthetic monitor from eu-west-1 confirms failure." Vague triggers cause premature or missed failovers.
Pre-flight checks — confirm the secondary is actually ready (replication lag <30 s, warm pool autoscaling in ACTIVE state, secrets rotated). Failing to check this converts one outage into two.
Ordered steps with rollback — each step has an owner, an expected outcome, and a rollback action. Example: "Promote RDS replica → verify writes succeed → update app config → proceed; if promote fails, page DBA on-call and hold."
Communication template — a pre-written status-page update and internal Slack message to copy-paste under pressure.
Post-failover validation — a checklist of smoke tests (synthetic transactions, key business metrics, queue depth) that confirm the secondary is actually serving correctly before the incident is downgraded.

Automated Failover

Automated failover reduces human reaction time from 5–30 minutes to 30–120 seconds, but it introduces the risk of a split-brain or a false-positive trigger that causes unnecessary failover. The mitigations are quorum-based decisions and circuit breakers.

# AWS Lambda: automated Route 53 failover triggered by CloudWatch Alarm
# (Simplified — production versions include idempotency checks and SNS alerts)
import boto3, os

r53  = boto3.client('route53')
sns  = boto3.client('sns')

PRIMARY_RECORD_ID   = os.environ['PRIMARY_RECORD_ID']
SECONDARY_RECORD_ID = os.environ['SECONDARY_RECORD_ID']
ZONE_ID             = os.environ['HOSTED_ZONE_ID']
TOPIC_ARN           = os.environ['ALERT_TOPIC_ARN']

def handler(event, context):
    alarm_state = event['detail']['state']['value']

    if alarm_state != 'ALARM':
        return  # Only act on transition to ALARM

    # Disable the primary health check so Route 53 routes to secondary
    r53.update_health_check(
        HealthCheckId=PRIMARY_RECORD_ID,
        Disabled=True,
    )

    sns.publish(
        TopicArn=TOPIC_ARN,
        Subject='DR FAILOVER TRIGGERED',
        Message=(
            'Automated failover initiated. Primary health check disabled.\n'
            'Validate secondary region before closing incident.'
        ),
    )

The golden rule of automated failover: automate the detection and the DNS switch, but require a human to approve the database promotion. Automating DNS is low-risk — it is reversible in seconds. Automating RDS replica promotion or Kafka MirrorMaker cutover is high-risk because it is difficult to reverse cleanly if the "failure" was a transient network blip.

RDS Failover: Aurora vs. Standard Multi-AZ

Aurora Global Database uses a DNS-based writer endpoint. Promotion of a secondary region to writer takes 1–2 minutes (AWS SLA: <1 minute RPO with managed planned failover). Standard RDS Multi-AZ failover is within-region only; cross-region requires a read replica and a manual or scripted promotion. The difference matters: Aurora Global is suitable for RTO <2 min; standard RDS cross-region is suitable for RTO 5–30 min depending on automation maturity.

Failover for Stateless vs. Stateful Components

Stateless services (API servers, workers) — DNS flip is sufficient. New region accepts traffic immediately once health checks pass. Scale-out warm pools (EC2 Auto Scaling predictive scaling, EKS Karpenter) should be pre-provisioned so the secondary does not cold-start under full load.
Stateful services (databases, caches, message queues) — require explicit promotion and lag validation. For Redis, ElastiCache Global Datastore provides automatic cross-region failover with <1 s RPO. For Kafka, MirrorMaker 2 active-passive requires manual consumer group offset rebase after promotion.

Pre-warm the secondary, always. At Netflix and similar orgs, the secondary region maintains at least 20–25 % of production capacity at all times, with auto-scaling triggered to reach 100 % within the failover window. Cold-starting a region from zero under full traffic — autoscaling from scratch, pulling container images, JVM warm-up — routinely doubles the effective RTO. The cost of idle capacity is trivially small versus the cost of an extended outage.

Failover Metrics to Track

In your DR observability stack, the following metrics must have dashboards and alarms ready before a drill:

HealthCheckPercentageHealthy — Route 53 CloudWatch metric; alarm at <100 % to get early warning.
ReplicaLag — RDS/Aurora replica lag in seconds; alarm at >30 s.
MirrorMaker2 replication latency — Kafka cross-region lag; alarm at >60 s.
DNS propagation time — measure from health-check flip to first request hitting secondary; track P99 across regions.
Time-to-first-byte in secondary — end-to-end synthetic monitor latency baseline, to distinguish "slow" from "failing."