MLOps & DevOps for AI Systems

CI/CD for Models

18 min Lesson 5 of 28

CI/CD for Models

You already run CI/CD pipelines for application code: lint, test, build, push, deploy. Extending that discipline to ML models means adding a parallel track that tests the model artifact with the same rigor you apply to the application binary — and then automating promotion decisions that your data science team previously made by hand. This is where MLOps stops being philosophy and becomes operational engineering.

The core challenge is that a model is not just code. A passing test suite on the training codebase does not tell you whether the resulting model artifact is safe to serve. You need layered gates: data quality checks, training validation, offline evaluation on held-out data, shadow deployment, and canary traffic — all automated and all blocking further promotion when they fail. At companies like Google, Spotify, and DoorDash, these gates run entirely without human intervention for the vast majority of routine model updates; engineers are paged only when a gate cannot automatically decide.

The Four Gate Layers

Think of the model pipeline as four sequential gate layers, each with automated pass/fail criteria. A model artifact must pass all preceding layers before advancing to the next.

Gate 1 — Data & Feature Validation: Run before training starts. Checks schema conformance, null rates, cardinality, distribution statistics against baselines, and training-serving skew. Tools: Great Expectations, dbt tests, Feast validation hooks. A failure here means the training data is corrupt or the feature pipeline has drifted — do not waste GPU time training on bad data.
Gate 2 — Training Validation: Run during and immediately after training. Checks that loss converged (no NaN loss, loss not stuck), that gradients are healthy (gradient norm within bounds), and that training metrics (AUC, F1, RMSE, etc.) meet a configured floor. An absolute floor ("AUC must be > 0.80") catches catastrophic failures; a relative delta ("AUC must not drop more than 2% from the current champion") catches regressions.
Gate 3 — Offline Evaluation: Run against a held-out test set and a set of behavioral tests (also called slice tests or model assertions). Checks overall metric targets, per-slice performance (performance must not degrade on protected or high-value segments), and adversarial robustness where relevant. This is also where you run your bias and fairness tests.
Gate 4 — Online Evaluation: Shadow traffic and canary rollout. The new model serves predictions in parallel with the champion, or on 1–5% of real traffic, while you compare latency, error rate, and online business metrics (click-through rate, conversion, revenue per request). Only after a configurable soak time with no regressions does the pipeline promote the new model to full production.

Key idea: Gates 1–3 are offline — they run in your CI/CD infrastructure before any production traffic touches the model. Gate 4 is online — it requires a deployment to a shadow or canary slot. Conflating them leads to either shipping untested models or blocking deploys forever waiting for production data. Keep them architecturally separate.

CI Pipeline: Training and Evaluation

The CI pipeline triggers on a merge to main (code change) or on a scheduled cron (periodic retraining from fresh data). Both paths run the same gate sequence. Here is a production-grade GitHub Actions workflow using MLflow for experiment tracking and a model registry for promotion:

# .github/workflows/model-ci.yml

name: Model CI

on:
  push:
    branches: [main]
    paths:
      - 'src/train/**'
      - 'src/features/**'
      - 'config/model.yaml'
  schedule:
    - cron: '0 2 * * *'    # nightly retraining from fresh data

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_DEFAULT_REGION: us-east-1

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -r requirements-ci.txt
      - name: Validate feature data
        run: |
          python scripts/validate_data.py \
            --expectation-suite config/ge_suite.json \
            --feature-snapshot s3://ml-artifacts/features/latest/ \
            --baseline s3://ml-artifacts/features/baseline/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

  train-and-evaluate:
    needs: data-validation
    runs-on: [self-hosted, gpu]    # GPU runner — standard EC2 g4dn.xlarge
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: |
          python src/train/train.py \
            --config config/model.yaml \
            --experiment-name "ci-${GITHUB_SHA::8}"
        id: train

      - name: Gate 2 — training metrics
        run: |
          python scripts/check_training_metrics.py \
            --run-id ${{ steps.train.outputs.run_id }} \
            --min-auc 0.82 \
            --max-auc-delta 0.02   # no more than 2% regression vs champion

      - name: Gate 3 — offline evaluation
        run: |
          python scripts/evaluate_offline.py \
            --run-id ${{ steps.train.outputs.run_id }} \
            --test-set s3://ml-artifacts/test-sets/latest/ \
            --slice-config config/eval_slices.yaml \
            --fairness-config config/fairness_thresholds.yaml

      - name: Register candidate model
        if: success()
        run: |
          python scripts/register_model.py \
            --run-id ${{ steps.train.outputs.run_id }} \
            --stage Staging    # goes to Staging, NOT Production yet

  notify-on-failure:
    needs: [data-validation, train-and-evaluate]
    if: failure()
    runs-on: ubuntu-latest
    steps:
      - name: Page on-call
        run: |
          curl -X POST ${{ secrets.PAGERDUTY_WEBHOOK }} \
            -H "Content-Type: application/json" \
            -d '{"event_action":"trigger","payload":{"summary":"Model CI failed on '$GITHUB_SHA'","severity":"warning","source":"github-actions"}}'

Pro practice: Use a dedicated self-hosted GPU runner for the training job. Spot/preemptible instances cut cost by 60–80% versus on-demand. Your CI framework (GitHub Actions, GitLab CI, Argo Workflows) should handle retry logic and send the job to a different AZ on spot interruption. Kubeflow Pipelines and Vertex AI Pipelines both have native spot-retry mechanisms.

The Model Deployment Pipeline (Diagrammed)

After the CI pipeline registers a candidate in Staging, a separate CD pipeline handles promotion to production. The two are deliberately decoupled: CI runs on every commit; CD runs when a human or an automated policy decides to promote. This mirrors the GitOps pattern you already use for application deployments — the model registry is the "GitOps repo" for model artifacts.

ML model CI/CD pipeline: four gate layers from data validation to full production rollout, with automatic failure paths and rollback.

Gate 3 in Depth: Slice Testing and Model Assertions

Overall metric thresholds are necessary but insufficient. A model can improve on aggregate AUC while silently degrading on a specific demographic segment, geographic region, or product category — the kind of failure that causes real-world harm and regulatory exposure. Production ML systems at regulated companies (finance, healthcare, hiring) run slice-based evaluation as a hard gate.

The evaluation config specifies which slices matter and what the per-slice floor is:

# config/eval_slices.yaml
# Each slice is evaluated independently. ANY slice failing its threshold
# blocks the entire promotion — the pipeline does not average across slices.

global_metrics:
  auc_roc:
    min: 0.83
    max_regression_vs_champion: 0.02

slices:
  - name: mobile_users
    filter: "platform == 'mobile'"
    metrics:
      auc_roc: { min: 0.80 }

  - name: new_accounts           # high-value business segment
    filter: "account_age_days < 30"
    metrics:
      auc_roc: { min: 0.78 }
      precision_at_k: { k: 10, min: 0.72 }

  - name: eu_region              # GDPR-regulated, fairness-sensitive
    filter: "region == 'EU'"
    metrics:
      auc_roc: { min: 0.81 }
      disparate_impact_ratio: { min: 0.80 }   # fairness metric: min 80% rule

behavioral_tests:               # model assertions — deterministic inputs
  - name: "invariance: typo robustness"
    input_pairs:
      - [canonical, perturbed]   # model output must not flip label
    max_flip_rate: 0.05

  - name: "directional: higher spend = higher score"
    direction: positive
    feature: transaction_amount
    expected_correlation: positive

Production pitfall: Teams often discover slice regressions only after a model has been serving production traffic for days — because evaluation was only run on the aggregate test set. By the time the slice degradation shows up in business dashboards (e.g., conversion drop for new users), it has already cost real revenue or user trust. Run slice tests in Gate 3 and treat any slice failure as a hard block, not a warning.

Automated Rollback and Champion/Challenger Logic

The CD pipeline should always know which model is the current champion (the model currently serving 100% of production traffic) and be able to roll back to it in under two minutes if the challenger fails online evaluation. In MLflow, model stages map cleanly to this: Staging for candidates, Production for the champion. Your deployment script compares the challenger against the champion before any traffic shift.

#!/usr/bin/env python3
# scripts/promote_model.py
# Called by the CD pipeline after canary soak passes.

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
MODEL_NAME = "fraud-detector"

def get_champion():
    versions = client.get_latest_versions(MODEL_NAME, stages=["Production"])
    return versions[0] if versions else None

def promote_challenger(challenger_version: str):
    champion = get_champion()

    # Archive the current champion (do NOT delete — keep for rollback audit trail)
    if champion:
        client.transition_model_version_stage(
            name=MODEL_NAME,
            version=champion.version,
            stage="Archived",
            archive_existing_versions=False,
        )
        print(f"Archived champion v{champion.version}")

    # Promote challenger to Production
    client.transition_model_version_stage(
        name=MODEL_NAME,
        version=challenger_version,
        stage="Production",
    )
    print(f"Promoted challenger v{challenger_version} to Production")

def rollback():
    """Called by monitoring alerting when Gate 4b detects online regression."""
    versions = client.search_model_versions(f"name='{MODEL_NAME}'")
    # find the most recent Archived version (previous champion)
    archived = [v for v in versions if v.current_stage == "Archived"]
    archived.sort(key=lambda v: int(v.version), reverse=True)
    if archived:
        prev = archived[0]
        promote_challenger(prev.version)
        print(f"Rollback complete: restored v{prev.version}")
    else:
        raise RuntimeError("No archived champion found — cannot roll back automatically")

if __name__ == "__main__":
    import sys
    if sys.argv[1] == "promote":
        promote_challenger(sys.argv[2])
    elif sys.argv[1] == "rollback":
        rollback()

Pro practice: Wire your automated rollback trigger to your observability stack, not just the pipeline. Use Prometheus alerts or Datadog monitors watching the online business metric (e.g., conversion rate, fraud catch rate). If the metric drops more than a configured threshold within 30 minutes of a model promotion, the alerting system should call promote_model.py rollback automatically — no engineer required. At DoorDash and Airbnb this closes the feedback loop to under five minutes from degradation to rollback.

Putting It Together: The Full Trigger Map

Production ML pipelines typically have three distinct trigger modes that must all feed into the same gate sequence:

Code change trigger: A PR merges to main touching model code, feature engineering, or hyperparameters. The pipeline trains from the current production dataset. This ensures every code change is validated against real data before promotion.
Scheduled trigger (continuous training): A nightly or weekly cron trains the model on a rolling window of recent data. The pipeline checks whether the new model beats the champion before promoting. If the champion is still winning, no promotion happens — the pipeline exits cleanly with a "champion retained" status.
Data-triggered retraining: A monitoring alert fires when production drift exceeds a threshold (covered in Lesson 7). The alert calls the pipeline API directly, bypassing the cron schedule, so the model can be retrained and promoted within hours of a distribution shift rather than days.

All three paths must emit the same structured metadata to your experiment tracker so you can audit "why did this model go to production on Tuesday?" months later. Treat that audit trail with the same seriousness as your deployment audit log — it is your defence in a model incident post-mortem and in regulatory review.