A/B Testing & Experimentation
A/B Testing & Experimentation
Canary releases and A/B tests look similar from the outside — both send a fraction of traffic to a variant. But they answer fundamentally different questions. A canary asks "Is this build safe to deploy?" — it is an operational gate focused on latency, error rates, and system health. An A/B test asks "Does this change improve user behavior?" — it is a product experiment focused on conversion, engagement, session length, or revenue. Conflating the two is one of the most common mistakes in progressive delivery.
Experiments vs. Canaries — the Conceptual Split
- Unit of comparison: A canary compares the new build against the old. An A/B test compares two or more product variants (which may share the same binary).
- Success metric: Canary success = "nothing broke." A/B success = statistically significant lift on a business metric.
- Traffic split duration: A canary runs until confidence is achieved and then rolls forward (hours). An A/B test runs until statistical significance is reached — often days or weeks.
- User assignment: Canary splits by request or pod. A/B splits by user identity (sticky sessions), so the same user always sees the same variant throughout the experiment.
- Who owns it: The SRE/platform team owns canary analysis. The product/growth team owns A/B experimentation, though the platform must provide the infrastructure.
The Experimentation Architecture
A mature experimentation platform has four layers:
Assignment layer: deterministic hashing on user_id + experiment_id guarantees the same user always lands in the same bucket. This is critical — if assignment is random per request, you get a mixed-treatment problem that invalidates your statistics.
Event pipeline: every user action emits an event tagged with the experiment variant. Impressions (exposure) and conversions (goal completions) must be tracked separately and joined later by the stats engine.
Metrics and Guardrails
Every experiment needs two classes of metrics defined before launch:
- Primary metric: the one business outcome you are optimizing. Examples: checkout completion rate, 7-day retention, click-through rate. You win or lose the experiment on this metric.
- Guardrail metrics: metrics that must not regress. Examples: page load time (p99), error rate, revenue per user. If any guardrail degrades beyond a threshold — even if the primary metric improves — the experiment is a failure and must be stopped.
Feature Flags as the Delivery Vehicle
A/B tests are almost always implemented via feature flags. The flag system provides the assignment logic and the override mechanism. A minimal setup with Unleash (self-hosted, open-source) looks like this:
In application code, read the variant and emit the impression event atomically so you never count a user who was assigned but never exposed:
Statistical Validity — What Big Tech Actually Enforces
Big-tech experimentation platforms enforce several statistical hygiene rules that many teams skip:
- Pre-registration: the primary metric and sample size (minimum detectable effect) are locked before the experiment starts. Changing the metric after seeing results is HARKing (Hypothesizing After Results are Known) — scientific fraud that inflates false-positive rates.
- No peeking / sequential testing: looking at p-values before the experiment ends and stopping early is the most common cause of false positives. Either enforce a fixed horizon or use a sequential testing method (e.g., the CUPED variance reduction technique used at Netflix, or mSPRT used at Booking.com).
- Novelty effect: users react differently to new things. Long-running experiments (2+ weeks) filter out the novelty spike and surface the true steady-state effect.
- SRM checks (Sample Ratio Mismatch): if you assigned 50/50 but the observed ratio is 48/52, something is broken in your assignment or logging. Always run an SRM check before trusting results.
Automated Guardrail Enforcement with Statsig
Statsig is a widely used experimentation platform (Notion, Brex, Figma) that automates guardrail checking and integrates with your metrics store:
Guardrail thresholds above are expressed as multipliers: 1.05 means "stop the experiment if the treatment\'s p99 latency exceeds 1.05x the control\'s p99 latency." Statsig (and similar platforms like Optimizely, Eppo, or internal platforms at Amazon/Google) will automatically flag or halt the experiment when any guardrail is breached, sending an alert to the owning team.
Experiment Lifecycle in Production
- Design: define hypothesis, primary metric, guardrails, MDE (minimum detectable effect), and required sample size using a power calculator.
- Launch: ship the flag dark (0% traffic), verify logging, then ramp to target allocation.
- Monitor: check SRM daily; alert on guardrail breaches; do not look at the primary metric until the planned end date.
- Conclude: at the planned horizon, read results. Ship the winner, document learnings, archive the flag.
- Clean up: remove the losing code path and delete the flag within one sprint. Flag debt accumulates fast and makes codebases unmaintainable.