Deployment Strategies & Progressive Delivery

Feature Flags

18 min Lesson 5 of 28

Feature Flags

A feature flag (also called a feature toggle or feature switch) is a conditional branch in code that routes execution based on a runtime configuration value rather than a code deployment. The flag can be flipped on or off without shipping new code. This decouples deployment — pushing new code to servers — from release — making that code active for users. It is the foundational primitive that makes every other progressive-delivery technique practical at scale.

Why the Best Engineering Orgs Live by Flags

At companies like Google, Netflix, Facebook, and Shopify, teams merge to main and deploy to production many times per day. The code for a half-built feature ships behind a flag that is false for everyone. When the feature is ready, the flag is enabled — first for internal employees (dogfood), then 1 % of traffic, then progressively wider. If metrics degrade, the flag is disabled in seconds, avoiding a full rollback. This workflow removes the coupling between code velocity and product risk.

Flag Types

Release toggles — hide incomplete work in production; removed once the feature is fully shipped.
Experiment toggles (A/B) — route cohorts to variants for statistical measurement; removed after winner is declared.
Ops toggles (kill switches) — disable a costly or broken subsystem under load without a deploy (e.g. disable a recommendation engine during peak traffic).
Permission toggles — enable features for specific user tiers or beta groups permanently (e.g. premium plan features).

Rule of thumb: Release and experiment toggles are short-lived (days to weeks). Ops and permission toggles can be long-lived. Treat them differently: short-lived flags must have a scheduled removal date written into the ticket the day they are created.

Anatomy of a Flag Evaluation

Every flag evaluation follows the same pipeline: the SDK receives a flag key and an evaluation context (user ID, country, plan, device), evaluates targeting rules in order, and returns a variant. The evaluation is synchronous and in-process — no network call during request handling — because the SDK caches a local copy of the flag ruleset from the flag delivery network (Relay Proxy in LaunchDarkly, edge worker in Flagsmith, etc.).

Feature flag evaluation pipeline: SDK uses a locally cached ruleset — no network call on the hot path.

Targeting and User Segmentation

Targeting rules let you control who sees a flag and how much. Rules are evaluated top-to-bottom; the first match wins:

Individual targeting — specific user IDs (internal team, QA engineers).
Attribute rules — country == "US", plan == "enterprise", version >= "2.0".
Percentage rollout — deterministic bucketing on a hash of userId + flagKey so the same user always lands in the same bucket (sticky assignment).
Default rule — the fallback for everyone else.

Sticky bucketing matters. If you use a random number on each request, a user will see the feature flicker on and off across page loads. Always hash userId + flagKey + salt for percentage rollouts to guarantee consistent user experience.

Implementing a Kill Switch with OpenFeature

OpenFeature is a CNCF standard SDK that abstracts away the flag provider (LaunchDarkly, Flagsmith, Unleash, custom). This lets you swap providers without rewriting application code.

# Install the OpenFeature Go SDK + Flagsmith provider
go get github.com/open-feature/go-sdk
go get github.com/open-feature/go-sdk-contrib/providers/flagsmith

# main.go — wire up the provider once at startup
import (
    "github.com/open-feature/go-sdk/pkg/openfeature"
    flagsmithProvider "github.com/open-feature/go-sdk-contrib/providers/flagsmith/pkg"
)

provider := flagsmithProvider.NewProvider(flagsmithProvider.Config{
    SDKKey:       os.Getenv("FLAGSMITH_SERVER_KEY"),
    EnablePolling: true,
    PollingInterval: 30 * time.Second,   // refresh ruleset every 30 s
})
openfeature.SetProvider(provider)
client := openfeature.NewClient("checkout-service")

# In request handler — evaluate the kill switch
enabled, err := client.BooleanValue(
    ctx,
    "checkout.new-payment-flow",   // flag key
    false,                          // safe default if SDK errors
    openfeature.NewEvaluationContext("user-"+userID, map[string]interface{}{
        "plan":    user.Plan,
        "country": user.Country,
    }),
)
if enabled {
    return newPaymentFlow(ctx, order)
}
return legacyPaymentFlow(ctx, order)

The false default is critical. If the flag service is unreachable, the SDK returns the default — which should always be the safe, known-good path. Never default to true for an unvalidated feature.

Flag Hygiene and Technical Debt

Flags accumulate. A codebase with 200 stale flags has combinatorial explosion in testing surface: 2^200 theoretical states. In practice this causes real incidents — a Salesforce outage in 2021 was partly attributed to an untested combination of legacy flags interacting. Flag debt is as dangerous as code debt.

Enforce these practices on every team:

Owner + expiry on creation — every flag must have an owner team and a remove_by date in the flag metadata. Block PR merges that add a flag without these fields.
Tag flags in code — add a comment // TODO: remove flag "checkout.new-payment-flow" by 2025-09-01 @payments-team next to every flag evaluation. This makes flags searchable in IDEs and static analysis.
Track flag evaluations — export evaluation events to your observability stack. If a flag has had zero evaluations in 30 days, it is a candidate for removal.
Junk drawer alert — your flag management console should show flags with no evaluations in the last 14 days highlighted in red. Most SaaS platforms (LaunchDarkly, Unleash) support this natively.

Production pitfall — permanent flags in a monorepo. When a team marks a permission toggle as "permanent" to avoid cleanup work, future engineers see it and assume all flags are permanent. Within 18 months you have 400 flags with no owners and no removal dates. Mandate code deletion for every flag that reaches 100 % rollout with no targeting rules — the code path for the old behavior must be deleted, not just dead-branched.

Self-Hosting Flags vs SaaS

For regulated industries or air-gapped environments, self-hosted solutions like Unleash (open-source, MIT) or Flagsmith (open-source, AGPL) are production-ready. Deploy with a Relay Proxy in each region to ensure sub-millisecond flag evaluation even if the central server is unreachable.

# Deploy Unleash self-hosted via Docker Compose (production pattern)
# docker-compose.yml excerpt

services:
  unleash:
    image: unleashorg/unleash-server:6
    environment:
      DATABASE_URL: "postgres://unleash:${DB_PASSWORD}@db:5432/unleash"
      UNLEASH_URL: "https://flags.internal.example.com"
      AUTH_TYPE: "custom"               # wire your SSO here
      INIT_ADMIN_API_TOKENS: "${ADMIN_TOKEN}"
    ports:
      - "4242:4242"
    depends_on:
      db:
        condition: service_healthy

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: unleash
      POSTGRES_PASSWORD: "${DB_PASSWORD}"
      POSTGRES_DB: unleash
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U unleash"]
      interval: 5s
      retries: 5

# Add a Relay Proxy per region to eliminate SPOF
  unleash-proxy:
    image: unleashorg/unleash-proxy:1
    environment:
      UNLEASH_URL: "http://unleash:4242/api/"
      UNLEASH_API_TOKEN: "${PROXY_TOKEN}"
      UNLEASH_PROXY_SECRETS: "${CLIENT_SECRET}"
    ports:
      - "3000:3000"

Flags and CI/CD Integration

Treat flag changes as deployments. When you flip a flag from 0 % to 10 % via the UI, an audit event should flow into your change management system (PagerDuty, Jira, ServiceNow). Wire the flag platform's webhook to your incident management tool so that if an SLO fires within five minutes of a flag change, the on-call responder sees the flag change highlighted in the incident timeline. This dramatically reduces MTTR (mean time to recovery).

GitOps for flags. Store flag definitions as YAML in your Git repo and apply them via CI. This gives you PR review for flag changes, automatic rollback via git revert, and a diff-based audit log — exactly what you get for Kubernetes manifests with ArgoCD. Flagsmith and Unleash both support configuration-as-code import/export.