Deployment Strategies & Progressive Delivery

A/B Testing & Experimentation

18 min Lesson 6 of 28

A/B Testing & Experimentation

Canary releases and A/B tests look similar from the outside — both send a fraction of traffic to a variant. But they answer fundamentally different questions. A canary asks "Is this build safe to deploy?" — it is an operational gate focused on latency, error rates, and system health. An A/B test asks "Does this change improve user behavior?" — it is a product experiment focused on conversion, engagement, session length, or revenue. Conflating the two is one of the most common mistakes in progressive delivery.

Experiments vs. Canaries — the Conceptual Split

Unit of comparison: A canary compares the new build against the old. An A/B test compares two or more product variants (which may share the same binary).
Success metric: Canary success = "nothing broke." A/B success = statistically significant lift on a business metric.
Traffic split duration: A canary runs until confidence is achieved and then rolls forward (hours). An A/B test runs until statistical significance is reached — often days or weeks.
User assignment: Canary splits by request or pod. A/B splits by user identity (sticky sessions), so the same user always sees the same variant throughout the experiment.
Who owns it: The SRE/platform team owns canary analysis. The product/growth team owns A/B experimentation, though the platform must provide the infrastructure.

At companies like Google, Netflix, and Meta, A/B experimentation is not an optional practice — it is the default mechanism for every product change that could affect user behavior. Netflix reportedly runs thousands of experiments simultaneously. That scale requires a dedicated experimentation platform, not ad-hoc feature flags.

The Experimentation Architecture

A mature experimentation platform has four layers:

Experimentation platform: assignment, variant serving, event collection, and statistical analysis.

Assignment layer: deterministic hashing on user_id + experiment_id guarantees the same user always lands in the same bucket. This is critical — if assignment is random per request, you get a mixed-treatment problem that invalidates your statistics.

Event pipeline: every user action emits an event tagged with the experiment variant. Impressions (exposure) and conversions (goal completions) must be tracked separately and joined later by the stats engine.

Metrics and Guardrails

Every experiment needs two classes of metrics defined before launch:

Primary metric: the one business outcome you are optimizing. Examples: checkout completion rate, 7-day retention, click-through rate. You win or lose the experiment on this metric.
Guardrail metrics: metrics that must not regress. Examples: page load time (p99), error rate, revenue per user. If any guardrail degrades beyond a threshold — even if the primary metric improves — the experiment is a failure and must be stopped.

Never run an experiment without guardrail metrics. The classic failure: a redesigned checkout page increases conversion rate by 2% but increases latency by 300 ms because of heavier JavaScript. Without a latency guardrail, you ship a degraded experience at scale. Guardrails are your safety net equivalent of canary error-rate checks — but for user-impacting quality signals.

Feature Flags as the Delivery Vehicle

A/B tests are almost always implemented via feature flags. The flag system provides the assignment logic and the override mechanism. A minimal setup with Unleash (self-hosted, open-source) looks like this:

# unleash-feature.yaml — gradual rollout strategy with variants
name: checkout-redesign
description: A/B test of the new checkout flow
enabled: true
strategies:
  - name: gradualRolloutUserId
    parameters:
      percentage: "100"      # 100% exposed to the experiment
      groupId: "checkout-redesign"
variants:
  - name: control
    weight: 500              # 50% (total weight = 1000)
    payload:
      type: string
      value: "original"
  - name: treatment
    weight: 500
    payload:
      type: string
      value: "redesigned"

In application code, read the variant and emit the impression event atomically so you never count a user who was assigned but never exposed:

// Node.js — Unleash SDK variant evaluation
const variant = unleash.getVariant('checkout-redesign', context);
// context = { userId: req.user.id }

if (variant.enabled) {
  // emit impression BEFORE rendering
  analytics.track('experiment_impression', {
    experiment_id: 'checkout-redesign',
    variant: variant.name,
    user_id: req.user.id,
    timestamp: Date.now(),
  });

  if (variant.name === 'treatment') {
    return renderRedesignedCheckout(req, res);
  }
}
return renderOriginalCheckout(req, res);

Statistical Validity — What Big Tech Actually Enforces

Big-tech experimentation platforms enforce several statistical hygiene rules that many teams skip:

Pre-registration: the primary metric and sample size (minimum detectable effect) are locked before the experiment starts. Changing the metric after seeing results is HARKing (Hypothesizing After Results are Known) — scientific fraud that inflates false-positive rates.
No peeking / sequential testing: looking at p-values before the experiment ends and stopping early is the most common cause of false positives. Either enforce a fixed horizon or use a sequential testing method (e.g., the CUPED variance reduction technique used at Netflix, or mSPRT used at Booking.com).
Novelty effect: users react differently to new things. Long-running experiments (2+ weeks) filter out the novelty spike and surface the true steady-state effect.
SRM checks (Sample Ratio Mismatch): if you assigned 50/50 but the observed ratio is 48/52, something is broken in your assignment or logging. Always run an SRM check before trusting results.

Use the CUPED (Controlled-experiment Using Pre-Experiment Data) technique to reduce variance. By regressing on a pre-experiment covariate (e.g., the user's conversion rate from the prior 2 weeks), you can cut required sample size by 40-70%. This means faster experiments, which is a massive competitive advantage. Netflix and Microsoft both publish papers on their CUPED implementations.

Automated Guardrail Enforcement with Statsig

Statsig is a widely used experimentation platform (Notion, Brex, Figma) that automates guardrail checking and integrates with your metrics store:

# Statsig experiment config via API (illustrative)
curl -X POST https://statsigapi.net/v1/experiments \
  -H "STATSIG-API-KEY: $STATSIG_SERVER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout_redesign",
    "id_type": "userID",
    "hypothesis": "Redesigned checkout increases completion rate",
    "primary_metrics": [
      { "name": "checkout_completion_rate", "type": "event_count_divided_by_user" }
    ],
    "secondary_metrics": [
      { "name": "revenue_per_user", "type": "sum" }
    ],
    "guardrail_metrics": [
      { "name": "p99_latency_ms",      "threshold": 1.05, "direction": "lower_is_better" },
      { "name": "error_rate",          "threshold": 1.10, "direction": "lower_is_better" },
      { "name": "support_ticket_rate", "threshold": 1.20, "direction": "lower_is_better" }
    ],
    "allocation": 50,
    "targeting": { "environments": ["production"] }
  }'

Guardrail thresholds above are expressed as multipliers: 1.05 means "stop the experiment if the treatment\'s p99 latency exceeds 1.05x the control\'s p99 latency." Statsig (and similar platforms like Optimizely, Eppo, or internal platforms at Amazon/Google) will automatically flag or halt the experiment when any guardrail is breached, sending an alert to the owning team.

Experiment Lifecycle in Production

Design: define hypothesis, primary metric, guardrails, MDE (minimum detectable effect), and required sample size using a power calculator.
Launch: ship the flag dark (0% traffic), verify logging, then ramp to target allocation.
Monitor: check SRM daily; alert on guardrail breaches; do not look at the primary metric until the planned end date.
Conclude: at the planned horizon, read results. Ship the winner, document learnings, archive the flag.
Clean up: remove the losing code path and delete the flag within one sprint. Flag debt accumulates fast and makes codebases unmaintainable.

Experiment velocity — the number of experiments shipped per quarter — is a leading indicator of product quality. Amazon\'s culture of experimentation, in which every significant product decision is tested, is a documented source of competitive advantage. Build your platform to maximize throughput, not just accuracy.