Deployment Strategies & Progressive Delivery

Automated Canary Analysis

18 min Lesson 7 of 28

Automated Canary Analysis

A canary deployment shifts a small slice of live traffic to a new version and waits to see whether that version behaves acceptably before moving further. The decision to promote or abort used to rely on an engineer staring at dashboards. Automated canary analysis replaces that human with a control loop: a controller queries your observability stack on a schedule, scores the canary against a policy, and promotes or rolls back without anyone pressing a button.

This is how Uber, Netflix, and Google SRE actually run deployments at scale. The controller does not trust a green CI badge — it trusts live production signals against a baseline. Done right, it catches the regressions that integration tests cannot: latency spikes under real traffic patterns, error-rate increases in specific geographic regions, memory leaks that only show up under sustained load.

The Analysis Loop

Every automated canary system — whether it is Argo Rollouts, Flagger, or a custom controller — implements the same fundamental loop. Understanding it precisely is the prerequisite for tuning it correctly.

The automated canary analysis loop: the Rollout Controller splits traffic, spawns an AnalysisRun that queries metrics providers on each interval, and promotes or aborts based on whether metric thresholds are met.

The loop has four phases in every interval: wait (collect enough data), query (fetch metrics from your observability backend), evaluate (compare canary vs baseline against the policy), decide (increment weight, hold, or abort). The number of intervals and the weight increment per step are the primary tuning knobs.

Argo Rollouts: Metric-Based Promotion

Argo Rollouts extends Kubernetes with a Rollout CRD that replaces a standard Deployment. The canary strategy section defines the step weights and references an AnalysisTemplate that holds the metric queries.

# rollout.yaml — Argo Rollouts Rollout with canary analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
        - name: api-service
          image: ghcr.io/myorg/api-service:v2.4.1
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"
  strategy:
    canary:
      # Service mesh or ingress-level traffic splitting
      trafficRouting:
        istio:
          virtualService:
            name: api-service-vsvc
            routes:
              - primary
      steps:
        - setWeight: 5        # 5% canary traffic
        - pause: {duration: 5m}
        - analysis:           # Run analysis WHILE at 5%
            templates:
              - templateName: success-rate
              - templateName: p99-latency
        - setWeight: 20
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: p99-latency
        - setWeight: 50
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: p99-latency
        # No explicit final step: Rollouts promotes to 100% on success
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable

# analysis-template.yaml — AnalysisTemplate querying Prometheus
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  args:
    - name: service-name
    - name: canary-hash   # injected by Rollouts from pod-template-hash
  metrics:
    - name: success-rate
      interval: 1m
      count: 5            # run 5 measurements; need 4 of 5 to pass
      successCondition: result[0] >= 0.99   # >= 99% success rate
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{
              job="api-service",
              pod=~".*{{args.canary-hash}}.*",
              status!~"5.."
            }[2m]))
            /
            sum(rate(http_requests_total{
              job="api-service",
              pod=~".*{{args.canary-hash}}.*"
            }[2m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p99-latency
  namespace: production
spec:
  metrics:
    - name: p99-latency
      interval: 1m
      count: 5
      successCondition: result[0] < 0.300   # p99 under 300 ms
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                job="api-service",
                pod=~".*{{args.canary-hash}}.*"
              }[2m])) by (le)
            )

Always query the canary pods directly, not the service. A Prometheus query that hits the whole service will average canary and stable pods together and mask a serious regression. Filter by pod label using the pod-template-hash that Rollouts injects as an arg. Similarly, scope your Datadog metrics with the version or rollouts_pod_template_hash tag.

Flagger: The Kubernetes-Native Alternative

Flagger is a CNCF project maintained by Weaveworks. Instead of a Rollout CRD it uses a Canary CRD that wraps an existing Deployment — Flagger creates and manages the primary and canary Deployment objects for you. This makes it easier to retrofit onto an existing cluster without restructuring manifests.

# flagger-canary.yaml — Flagger Canary resource wrapping a Deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service           # Flagger owns api-service-primary + api-service-canary
  progressDeadlineSeconds: 600
  service:
    port: 8080
    targetPort: 8080
    gateways:
      - public-gateway.istio-system.svc.cluster.local
    hosts:
      - api.myorg.com
  analysis:
    interval: 1m                # evaluate every 60 seconds
    threshold: 5                # abort after 5 consecutive failures
    maxWeight: 50               # never exceed 50% canary traffic
    stepWeight: 10              # increase by 10% each successful step
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99               # require >= 99% success rate
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500              # require p99 <= 500ms
        interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://api-service-canary.production/healthz | grep OK"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://api-service-canary.production/"

Always drive artificial load into the canary during analysis. Without load, the Prometheus queries return no data, which Flagger and Argo Rollouts treat as a pass-by-default. Use the Flagger load tester webhook or a separate k6 / hey job that continuously hammers the canary endpoint during the analysis window. Never rely on ambient production traffic alone — especially in the early steps when the canary weight is only 5-10%.

Metric Selection: What to Measure

The metrics you analyze are the most consequential design decision. Use this hierarchy, prioritized by signal-to-noise ratio:

HTTP 5xx error rate — the minimum bar. A canary that increases server errors from 0.1% to 2% should fail immediately. Measure as (non-5xx requests) / (total requests) to avoid division-by-zero when the canary is cold.
Request latency percentiles — p99 and p99.9 catch tail-latency regressions that average latency hides. A new version that doubles p99 while keeping average the same will degrade the experience of 1% of users — which at 100k RPS is 1,000 people per second.
Business metrics — for user-facing services: checkout initiation rate, login success rate, search result click-through. These catch logic regressions that return 200 OK but compute the wrong answer.
Resource saturation — CPU and memory growth trends over the analysis window. A version that looks fine at 10% traffic but will OOM at 100% is detectable early if you track container_memory_working_set_bytes growth rate.

Rollback Mechanics and Production Failure Modes

When an AnalysisRun fails, both Argo Rollouts and Flagger need to drain the canary cleanly. Argo Rollouts sets canary weight to 0, waits for in-flight requests to drain (controlled by progressDeadlineSeconds and your service mesh drain timeout), then scales the canary ReplicaSet to 0. The stable pods never stopped serving, so users experience only the increased latency during the canary window — not an outage.

Inconclusive analyses are a silent failure mode. If your Prometheus query returns no data (scrape target down, metric name typo, label mismatch), Argo Rollouts marks the metric as Inconclusive, not Failed. By default, three consecutive Inconclusive results abort the rollout — but many teams change this threshold without understanding the implication. A misconfigured query that always returns no data will cause every canary to abort for the wrong reason, making the team distrust the system and disable it. Always smoke-test your PromQL queries manually against the Prometheus UI before setting them in an AnalysisTemplate.

The most important operational habit is to monitor kubectl argo rollouts get rollout api-service -w during a deployment. This streams the step progression, analysis results, and the exact metric values that drove each pass/fail decision — making post-mortems straightforward when a canary is aborted:

# Watch a live rollout (Argo Rollouts)
kubectl argo rollouts get rollout api-service -w -n production

# Force-promote past the current step (use carefully — bypasses analysis)
kubectl argo rollouts promote api-service -n production

# Immediately abort and rollback
kubectl argo rollouts abort api-service -n production

# List all AnalysisRuns for a rollout (inspect metric values per run)
kubectl get analysisrun -n production -l rollout=api-service \
  --sort-by=.metadata.creationTimestamp

# Describe the latest AnalysisRun for detailed metric results
kubectl describe analysisrun \
  $(kubectl get analysisrun -n production -l rollout=api-service \
    -o jsonpath='{.items[-1].metadata.name}')

Automated canary analysis is not a silver bullet — it is only as good as the metrics you feed it. The discipline is in designing thresholds that are tight enough to catch real regressions but loose enough to survive normal traffic variance. Start conservative (abort at 1% error-rate increase, 50ms p99 increase), collect data on false-positive rates over several releases, and tighten or loosen from evidence. After a few months of well-tuned analysis, your team stops fearing Friday afternoon deploys.