Experiment Tracking & Model Registries
Experiment Tracking & Model Registries
At Google, Meta, and Uber, a single production model may be the result of thousands of training runs — different feature sets, learning rates, regularisation strategies, and hardware configurations. Without disciplined tracking, you cannot answer the most basic production question: why does the model in prod behave differently from the one the data-science team demoed last week? Experiment tracking and model registries are the version control and CI artefact store of ML, and they are the foundation every other MLOps practice builds on.
What Gets Tracked and Why
Every training run produces several classes of artefact that must be captured together to make a run reproducible:
- Parameters — hyperparameters passed to the training job (learning rate, batch size, regularisation coefficients, model architecture flags).
- Metrics — time-series of loss, accuracy, F1, AUC, or whatever domain-specific signals matter, logged at configurable step intervals.
- Artefacts — the serialised model weights (
.pkl,SavedModel,ONNX), feature-engineering pipelines, tokenisers, and evaluation reports that must travel with the model. - Environment — Python version, library versions (
requirements.txtor a conda environment YAML), and the Git commit SHA of the training code. - Data lineage — dataset name, version, and hash so you can prove the model was trained on a specific snapshot of the feature store.
Capturing all five categories on every run is non-negotiable at production scale. Skipping any one of them will eventually produce a run you cannot reproduce, a model you cannot audit, or a regression you cannot explain to a compliance team.
MLflow: The De Facto Standard
MLflow is the most widely deployed open-source tracking solution, and its API design influenced every proprietary alternative (Weights & Biases, Neptune, Comet, Vertex AI Experiments). Understanding MLflow deeply means you can navigate any of them. The core primitives are:
- Tracking Server — stores run metadata and metrics in a backend store (SQLite locally, PostgreSQL or MySQL in production) and artefacts in an artefact store (S3, GCS, Azure Blob, or a mounted NFS volume).
- Experiments — logical groupings of runs (one per model family or business objective, e.g.,
fraud-detection-v2). - Runs — a single training execution with its own UUID, start/end timestamps, and status (
RUNNING,FINISHED,FAILED,KILLED). - Model Registry — a lifecycle manager that promotes run artefacts through named stages:
Staging,Production,Archived.
Deployment with 2+ replicas fronted by a ClusterIP Service and a restricted NetworkPolicy. The Postgres backend must be on the same VPC with a dedicated credentials secret.Instrumenting a Training Run
The client API is intentionally thin. The pattern used in production codebases separates the MLflow calls from the actual training logic so the tracking code can be swapped or disabled without touching model code:
The registered_model_name argument does two things simultaneously: it logs the model artefact under the run, and it creates or updates a registered model entry in the Model Registry with a new version pointing to the same artefact. The model version starts in the None stage.
signature and an input_example to log_model. The signature becomes the contract the serving layer validates requests against. The example is used by the MLflow UI's built-in inference playground and by downstream CI jobs that do schema drift checks.The Model Registry: Promotion Workflow
The Model Registry is where a raw run artefact becomes a governed, deployable entity. The standard promotion ladder in enterprise MLOps teams is:
Promotion is executed via the MLflow client or REST API, which makes it automatable from a CI pipeline:
archive_existing_versions=True on promotion is convenient but risky in fast-moving teams — it archives every version in that stage, including canary or shadow versions. In orgs with multi-variant serving, set it to False and manage archival explicitly via a release management script that knows which versions are actively receiving traffic.Comparing Runs and Enforcing Quality Gates
The value of tracking only materialises when it drives automated decisions. A typical CI job in a GitHub Actions or Argo Workflows pipeline will:
- Query the Registry for the current
Productionmodel version and fetch its validation metrics. - Compare the candidate run\'s metrics against those baseline numbers using a configurable threshold (e.g., AUC must not regress more than 0.5%, P95 inference latency must not increase more than 10 ms).
- Run a bias/fairness evaluation across protected attribute slices and fail the gate if any slice diverges beyond a defined bound.
- Validate the model signature matches the feature store schema for the target serving environment.
- Only on a clean pass: call
transition_model_version_stageto promote to Staging, then trigger a shadow-traffic evaluation in the serving tier.
This CI gate replaces the informal "looks good to me" review that causes regressions in prod. At Meta and Airbnb, the gate is enforced in an automated manner — humans only approve the final Production promotion, not the Staging transition.
Scaling the Tracking Server
At organisations running tens of thousands of training jobs per day (Google Brain, OpenAI), a single MLflow server is a bottleneck. The common production patterns are: horizontal scaling behind an L7 load balancer with sticky sessions disabled (all state is in Postgres + S3), read replicas for the Postgres backend serving UI queries, and separate artefact upload paths that bypass the tracking server entirely and write directly to S3 using pre-signed URLs. The MLflow --default-artifact-root can be set to an S3 URI per-experiment to route large model weights to a different bucket tier than lightweight metric logs.
mlflow gc (garbage-collect deleted runs) on a cron schedule, and archive completed runs older than 90 days to cold storage. Tag runs with ttl=30d so automated retention scripts can identify ephemeral experiment runs versus promoted model candidates that must be retained for compliance.Alternatives: When to Choose Something Else
MLflow is not always the right tool. Use Weights & Biases when the team heavily uses deep learning and needs rich real-time loss curve visualisation with GPU utilisation dashboards — the UI is significantly more polished. Use Vertex AI Experiments or SageMaker Experiments when the entire training infrastructure is cloud-vendor-managed and the operational overhead of a self-hosted tracking server is unacceptable. Use DVC when the primary concern is data and pipeline versioning with Git integration rather than metric tracking — it treats ML artefacts as Git LFS objects. In practice, most large platforms combine MLflow for metric/model governance with DVC for data lineage.