MLOps & DevOps for AI Systems

Experiment Tracking & Model Registries

18 min Lesson 3 of 28

Experiment Tracking & Model Registries

At Google, Meta, and Uber, a single production model may be the result of thousands of training runs — different feature sets, learning rates, regularisation strategies, and hardware configurations. Without disciplined tracking, you cannot answer the most basic production question: why does the model in prod behave differently from the one the data-science team demoed last week? Experiment tracking and model registries are the version control and CI artefact store of ML, and they are the foundation every other MLOps practice builds on.

What Gets Tracked and Why

Every training run produces several classes of artefact that must be captured together to make a run reproducible:

Parameters — hyperparameters passed to the training job (learning rate, batch size, regularisation coefficients, model architecture flags).
Metrics — time-series of loss, accuracy, F1, AUC, or whatever domain-specific signals matter, logged at configurable step intervals.
Artefacts — the serialised model weights (.pkl, SavedModel, ONNX), feature-engineering pipelines, tokenisers, and evaluation reports that must travel with the model.
Environment — Python version, library versions (requirements.txt or a conda environment YAML), and the Git commit SHA of the training code.
Data lineage — dataset name, version, and hash so you can prove the model was trained on a specific snapshot of the feature store.

Capturing all five categories on every run is non-negotiable at production scale. Skipping any one of them will eventually produce a run you cannot reproduce, a model you cannot audit, or a regression you cannot explain to a compliance team.

MLflow: The De Facto Standard

MLflow is the most widely deployed open-source tracking solution, and its API design influenced every proprietary alternative (Weights & Biases, Neptune, Comet, Vertex AI Experiments). Understanding MLflow deeply means you can navigate any of them. The core primitives are:

Tracking Server — stores run metadata and metrics in a backend store (SQLite locally, PostgreSQL or MySQL in production) and artefacts in an artefact store (S3, GCS, Azure Blob, or a mounted NFS volume).
Experiments — logical groupings of runs (one per model family or business objective, e.g., fraud-detection-v2).
Runs — a single training execution with its own UUID, start/end timestamps, and status (RUNNING, FINISHED, FAILED, KILLED).
Model Registry — a lifecycle manager that promotes run artefacts through named stages: Staging, Production, Archived.

# Install MLflow and start a production-grade tracking server
pip install mlflow==2.13.0 psycopg2-binary boto3

# Start the tracking server backed by PostgreSQL + S3
mlflow server \
  --backend-store-uri postgresql://mlflow:secret@postgres:5432/mlflow \
  --default-artifact-root s3://my-mlops-bucket/mlflow-artifacts \
  --host 0.0.0.0 \
  --port 5000 \
  --workers 4

# Environment variable so training code finds the server
export MLFLOW_TRACKING_URI=http://mlflow.internal:5000

Run the tracking server behind an internal load balancer — never expose it publicly. In Kubernetes, deploy it as a Deployment with 2+ replicas fronted by a ClusterIP Service and a restricted NetworkPolicy. The Postgres backend must be on the same VPC with a dedicated credentials secret.

Instrumenting a Training Run

The client API is intentionally thin. The pattern used in production codebases separates the MLflow calls from the actual training logic so the tracking code can be swapped or disabled without touching model code:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

mlflow.set_tracking_uri("http://mlflow.internal:5000")
mlflow.set_experiment("fraud-detection-v2")

params = {
    "n_estimators": 400,
    "max_depth": 6,
    "learning_rate": 0.05,
    "subsample": 0.8,
}

with mlflow.start_run(run_name="gbm-baseline-2025-06-11") as run:
    # Log parameters
    mlflow.log_params(params)

    # Log the data version so the run is reproducible
    mlflow.set_tag("dataset_version", "fraud-features-v3.2.1")
    mlflow.set_tag("git_sha", "a1b2c3d")

    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)

    # Log scalar metrics
    auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
    mlflow.log_metric("val_auc", auc)

    # Log model with an explicit signature (input/output schema)
    signature = mlflow.models.infer_signature(X_train, model.predict(X_train))
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        signature=signature,
        registered_model_name="fraud-detection",   # auto-registers on the Registry
    )

    print(f"Run ID: {run.info.run_id}  AUC: {auc:.4f}")

The registered_model_name argument does two things simultaneously: it logs the model artefact under the run, and it creates or updates a registered model entry in the Model Registry with a new version pointing to the same artefact. The model version starts in the None stage.

Always log an explicit signature and an input_example to log_model. The signature becomes the contract the serving layer validates requests against. The example is used by the MLflow UI's built-in inference playground and by downstream CI jobs that do schema drift checks.

The Model Registry: Promotion Workflow

The Model Registry is where a raw run artefact becomes a governed, deployable entity. The standard promotion ladder in enterprise MLOps teams is:

Model Registry promotion stages: from a raw training run through CI gates and human approval to the live Production stage, then Archived on retirement.

Promotion is executed via the MLflow client or REST API, which makes it automatable from a CI pipeline:

from mlflow import MlflowClient

client = MlflowClient("http://mlflow.internal:5000")

# Transition the freshly registered version to Staging
client.transition_model_version_stage(
    name="fraud-detection",
    version="42",
    stage="Staging",
    archive_existing_versions=False,   # keep previous Staging for comparison
)

# After CI gate passes, promote to Production and archive the old one
client.transition_model_version_stage(
    name="fraud-detection",
    version="42",
    stage="Production",
    archive_existing_versions=True,    # automatically retire the previous Production version
)

# Add an approval annotation (auditable)
client.update_model_version(
    name="fraud-detection",
    version="42",
    description="Approved by @alice 2025-06-11. AUC 0.9712 vs 0.9680 baseline. Shadow run clean.",
)

Production footgun: archive_existing_versions=True on promotion is convenient but risky in fast-moving teams — it archives every version in that stage, including canary or shadow versions. In orgs with multi-variant serving, set it to False and manage archival explicitly via a release management script that knows which versions are actively receiving traffic.

Comparing Runs and Enforcing Quality Gates

The value of tracking only materialises when it drives automated decisions. A typical CI job in a GitHub Actions or Argo Workflows pipeline will:

Query the Registry for the current Production model version and fetch its validation metrics.
Compare the candidate run\'s metrics against those baseline numbers using a configurable threshold (e.g., AUC must not regress more than 0.5%, P95 inference latency must not increase more than 10 ms).
Run a bias/fairness evaluation across protected attribute slices and fail the gate if any slice diverges beyond a defined bound.
Validate the model signature matches the feature store schema for the target serving environment.
Only on a clean pass: call transition_model_version_stage to promote to Staging, then trigger a shadow-traffic evaluation in the serving tier.

This CI gate replaces the informal "looks good to me" review that causes regressions in prod. At Meta and Airbnb, the gate is enforced in an automated manner — humans only approve the final Production promotion, not the Staging transition.

Scaling the Tracking Server

At organisations running tens of thousands of training jobs per day (Google Brain, OpenAI), a single MLflow server is a bottleneck. The common production patterns are: horizontal scaling behind an L7 load balancer with sticky sessions disabled (all state is in Postgres + S3), read replicas for the Postgres backend serving UI queries, and separate artefact upload paths that bypass the tracking server entirely and write directly to S3 using pre-signed URLs. The MLflow --default-artifact-root can be set to an S3 URI per-experiment to route large model weights to a different bucket tier than lightweight metric logs.

Enforce a data-retention policy on runs. A Postgres backend that accumulates unlimited metric steps will grow to hundreds of gigabytes within months on an active team. Use mlflow gc (garbage-collect deleted runs) on a cron schedule, and archive completed runs older than 90 days to cold storage. Tag runs with ttl=30d so automated retention scripts can identify ephemeral experiment runs versus promoted model candidates that must be retained for compliance.

Alternatives: When to Choose Something Else

MLflow is not always the right tool. Use Weights & Biases when the team heavily uses deep learning and needs rich real-time loss curve visualisation with GPU utilisation dashboards — the UI is significantly more polished. Use Vertex AI Experiments or SageMaker Experiments when the entire training infrastructure is cloud-vendor-managed and the operational overhead of a self-hosted tracking server is unacceptable. Use DVC when the primary concern is data and pipeline versioning with Git integration rather than metric tracking — it treats ML artefacts as Git LFS objects. In practice, most large platforms combine MLflow for metric/model governance with DVC for data lineage.