MLOps & DevOps for AI Systems

Monitoring Models in Production

18 min Lesson 7 of 28

Monitoring Models in Production

You already run a mature observability stack: Prometheus scrapes infrastructure metrics, Loki aggregates logs, Tempo traces requests end-to-end, and your SLO dashboards page on-call when error rates spike. That stack is necessary but not sufficient for ML systems. A deployed model can silently degrade for weeks — serving confidently wrong predictions — while every infrastructure metric stays green. CPU, memory, request latency, and error rate all look nominal because the model is answering; it is just answering incorrectly. This is the central operational challenge of production ML: the model is the bug, and your existing monitors cannot see it.

Production model monitoring is the discipline of detecting three categories of silent failure: distribution shift in the incoming data (data drift), divergence between model outputs and the true values they are predicting (prediction drift), and degradation in measured accuracy when ground-truth labels eventually arrive (model performance monitoring). Each requires different detection machinery and different operational responses.

Data Drift: When the World Changes

A model learns a function from a training distribution. The moment real-world inputs start looking different from that distribution, the model is being asked to extrapolate — and its behavior becomes undefined. This is data drift, also called covariate shift: the marginal distribution of input features P(X) changes without a corresponding change in the label-generating process P(Y|X).

At big-tech scale, data drift is detected by comparing statistical properties of a reference window (typically the training or validation dataset) to a live inference window (the last N requests, usually 1–24 hours of traffic). The key statistics to track per feature are:

Continuous features: Population Stability Index (PSI), Kolmogorov-Smirnov statistic, Wasserstein distance, mean and standard deviation shift.
Categorical features: Chi-squared test, Jensen-Shannon divergence on the category frequency distribution, new category detection (unseen values at inference time).
Multivariate: Maximum Mean Discrepancy (MMD) or a learned drift detector — a classifier trained to distinguish reference from live samples. If it succeeds with high accuracy, drift is present.

Key idea — PSI threshold conventions: PSI below 0.1 is negligible drift, 0.1–0.2 warrants investigation, above 0.2 is significant and should trigger a retraining workflow. These thresholds were established in credit scoring (where PSI originated) but are widely reused across ML verticals. Use them as starting points, then calibrate to your model's observed sensitivity.

The operational toolchain for drift detection is built around open-source libraries such as Evidently AI (generates HTML/JSON drift reports from pandas DataFrames), NannyML (includes label-free performance estimation), and managed platforms such as Arize AI, Whylogs, and AWS SageMaker Model Monitor. In a Kubernetes deployment, you typically run a sidecar or a separate batch job that reads inference logs from your feature store or a Kafka topic, computes drift statistics, and pushes them to Prometheus as custom metrics that your existing Grafana dashboards can consume.

# evidently_drift_job.py — run as a Kubernetes CronJob every hour
# Reads last-hour inference logs, computes drift vs. training reference data.

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from evidently import ColumnMapping
import requests

# Load reference (training snapshot) and current (last-hour inference window)
reference = pd.read_parquet("s3://ml-features/reference/train_features_v3.parquet")
current   = pd.read_parquet("s3://ml-features/inference/2024-01-15T14.parquet")

column_mapping = ColumnMapping(
    target=None,                        # ground truth not yet available
    prediction="predicted_score",
    numerical_features=["age", "credit_balance", "txn_count_7d"],
    categorical_features=["country_code", "device_type"],
)

report = Report(metrics=[DataDriftPreset(drift_share=0.3)])
report.run(reference_data=reference, current_data=current,
           column_mapping=column_mapping)

result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
share_drifted  = result["metrics"][0]["result"]["share_of_drifted_columns"]

# Push custom metric to Prometheus Pushgateway
payload = (
    "# HELP ml_data_drift_share Fraction of features with detected drift\n"
    "# TYPE ml_data_drift_share gauge\n"
    f"ml_data_drift_share{{model=\"fraud_v3\",env=\"prod\"}} {share_drifted}\n"
)
requests.post("http://pushgateway:9091/metrics/job/ml_drift_job", data=payload)

if drift_detected:
    # Trigger Argo Workflows retraining DAG via REST API
    requests.post(
        "http://argo-server:2746/api/v1/workflows/ml-platform",
        json={"resourceKind": "WorkflowTemplate",
              "resourceName": "fraud-retrain-v3",
              "submitOptions": {"labels": "trigger=drift"}},
        headers={"Authorization": "Bearer $ARGO_TOKEN"},
    )
    print(f"Drift detected ({share_drifted:.1%} of features). Retraining triggered.")

Prediction Drift: Monitoring Model Outputs

Prediction drift (also called concept drift when it is caused by a change in P(Y|X)) is detected by watching the distribution of the model's own output scores or class probabilities over time. This is valuable precisely because you do not need ground-truth labels to detect it — you only need the model's outputs, which are available the instant an inference is served.

Common signals to track in your Grafana dashboards:

Score distribution shift: The histogram of predicted probabilities should be stable. A sudden shift toward higher or lower mean scores is a red flag — e.g., a fraud model that starts scoring 90% of transactions as high-risk has almost certainly drifted.
Class imbalance in predictions: Track the fraction of requests landing in each predicted class. A binary model that was predicting 2% positive in January but is predicting 8% positive in March has changed behavior regardless of whether the world changed or the model degraded.
Confidence calibration: Track the fraction of predictions in each confidence bucket (e.g., 90–95%, 95–99%, 99–100%). A well-calibrated model's confidence distribution should be stable over time.

Production practice: Log every inference — input features, model version, prediction, confidence score, and timestamp — to a dedicated inference store (a partitioned S3 table or a BigQuery/Redshift dataset). This is the foundation of all downstream monitoring and is also essential for debugging production incidents. Engineers at Meta's ML platform log 100% of inference calls for high-stakes models; some teams sample at 10% for very high-throughput models to control storage costs. Never log less than 1% — you need statistical power for drift tests.

Model Performance Monitoring: The Ground-Truth Problem

The hardest problem in production model monitoring is measuring actual accuracy, because ground-truth labels are delayed or never arrive. When you serve a fraud prediction, you will not know the true outcome (fraud or not) until the transaction is resolved — which can take days or weeks. When you serve a churn prediction, the true outcome is not known until the customer churns or does not, which is months away. This is the label latency problem.

The operational strategies to cope with it are:

Proxy metrics: Identify a faster signal that correlates with model quality. For a recommendation model, click-through rate and watch time are proxies for relevance. For a search ranking model, position-adjusted clicks (NDCG) are a proxy for ranking quality. Build alerts on proxy metrics first.
Label pipelines: Build automated pipelines that join deferred outcomes back to inference logs and compute accuracy on a rolling window. At Google, these pipelines are a first-class part of the ML platform; the latency-weighted evaluation score is computed continuously and compared to a performance SLO.
Shadow evaluation: When retraining produces a challenger model, run it in shadow mode (receives live traffic, predictions logged but not served to users) and compare its deferred-label accuracy to the champion model over a statistically meaningful window before promoting it.
Holdout sets: Reserve a fixed slice of data that you never use for training, and periodically re-evaluate the production model against it. A performance drop on a static holdout that you have full labels for is unambiguous evidence of model degradation.

Production model monitoring architecture: inference logs feed both the drift detector (no labels needed) and the performance evaluator (requires deferred ground-truth labels), with alerts routing to PagerDuty and automated retraining.

Feedback Loops: When Monitoring Changes the Model

A uniquely dangerous failure mode in production ML is the closed-loop feedback problem: the model's own predictions influence the data that will be used to evaluate or retrain it. This is not a hypothetical; it has caused some of the most costly ML incidents at scale companies.

Consider a loan approval model. It is trained on historical approval decisions and repayment outcomes. When deployed, it begins rejecting applicants who might actually repay — but because they were rejected, there is never a repayment outcome for them. The next training dataset therefore has no positive examples for that population, the model learns an even stronger rejection bias, and the feedback loop amplifies the original error. This is label exposure bias or survivorship bias in the training data.

For recommendation and ranking models, the equivalent is the popularity bias feedback loop: the model surfaces popular items, users click popular items, the click data trains the next model to surface even more popular items, and niche items get progressively buried regardless of their intrinsic quality.

Operational mitigations — all requiring deliberate engineering, not just better tooling — include:

Counterfactual logging: Log what the model would have predicted for options it did not select (for ranking models) or applicants it would have rejected (for decision models). This requires dedicated infrastructure to score counterfactual inputs at serving time.
Exploration policies: Inject a small fraction of random or diverse decisions (epsilon-greedy or Thompson sampling) to ensure the model continues to see labels for the full input distribution. Netflix and Spotify both use exploration budgets in their recommendation systems.
Retraining dataset audits: Before each retraining run, run a coverage check: does the training dataset contain sufficient examples of low-confidence regions? Enforce a minimum coverage floor as a training pipeline gate.

Production pitfall — alert fatigue from overly sensitive drift thresholds: Setting PSI alert thresholds too low will produce continuous false positives, especially for models with seasonal or weekly traffic patterns (e.g., a model trained on Monday business traffic will naturally show feature drift on Saturday). Segment your reference windows by day-of-week or time-of-day, and use adaptive thresholds that account for expected cyclical variation, before paging an on-call engineer. A drift alert that fires every Saturday morning and is always suppressed without investigation is worse than no alert — it trains engineers to ignore the signal.

Kubernetes-Native Monitoring Setup

In a Kubernetes-based ML platform, model monitoring is typically implemented as a set of batch CronJob resources that write metrics to Prometheus via the Pushgateway, plus a real-time component that reads from a Kafka topic of inference events. The following Kubernetes manifest shows the CronJob pattern for hourly drift computation:

# k8s/ml-monitoring/drift-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: fraud-model-drift-check
  namespace: ml-monitoring
spec:
  schedule: "0 * * * *"          # run every hour
  concurrencyPolicy: Forbid       # skip if previous run is still active
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          serviceAccountName: ml-monitoring-sa
          containers:
          - name: drift-checker
            image: ml-platform/drift-checker:v2.1.4
            env:
            - name: MODEL_NAME
              value: "fraud_v3"
            - name: REFERENCE_PATH
              value: "s3://ml-features/reference/train_features_v3.parquet"
            - name: INFERENCE_BUCKET
              value: "s3://ml-features/inference"
            - name: PUSHGATEWAY_URL
              value: "http://prometheus-pushgateway.monitoring:9091"
            - name: ARGO_SERVER
              value: "http://argo-server.argo:2746"
            - name: PSI_ALERT_THRESHOLD
              value: "0.2"
            resources:
              requests:
                memory: "2Gi"
                cpu: "500m"
              limits:
                memory: "4Gi"
                cpu: "2"
            volumeMounts:
            - name: aws-creds
              mountPath: /root/.aws
              readOnly: true
          volumes:
          - name: aws-creds
            secret:
              secretName: aws-ml-monitoring-creds
---
# PrometheusRule for drift alerting — integrates with your existing Alertmanager
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-drift-alerts
  namespace: ml-monitoring
  labels:
    prometheus: kube-prometheus
spec:
  groups:
  - name: ml.drift
    interval: 5m
    rules:
    - alert: MLDataDriftHigh
      expr: ml_data_drift_share{env="prod"} > 0.2
      for: 10m
      labels:
        severity: warning
        team: ml-platform
      annotations:
        summary: "Data drift detected for model {{ $labels.model }}"
        description: "{{ $value | humanizePercentage }} of features are drifting. Automated retraining triggered."
    - alert: MLPredictionScoreShift
      expr: abs(ml_prediction_mean{env="prod"} - ml_prediction_mean_7d{env="prod"}) > 0.15
      for: 30m
      labels:
        severity: critical
        team: ml-platform
      annotations:
        summary: "Prediction score distribution shifted for {{ $labels.model }}"

Pro practice — tie model monitoring to your SLO framework: Define explicit Model Performance SLOs alongside your reliability SLOs. For example: "The fraud detection model AUROC must remain above 0.91 on the rolling 7-day labeled evaluation window, with a 99.5% monthly budget." When the performance SLO is burned down, the response is a retrain-and-promote incident, managed the same way you manage reliability incidents — with an incident commander, a timeline, and a postmortem. This organizational alignment between SRE and ML engineering is one of the markers of a mature ML operations practice.

Putting It Together: A Monitoring Checklist

Before any model goes to production, validate that the following monitoring contracts are in place:

Inference logging — 100% of requests logged with features, prediction, confidence, model version, and timestamp.
Data drift job — hourly PSI/KS computation vs. training reference, custom metric pushed to Prometheus, alert at PSI > 0.2.
Prediction drift dashboard — rolling score distribution histogram and class balance tracked in Grafana.
Label pipeline — automated join of deferred ground-truth labels to inference logs, rolling AUROC/F1/RMSE computed and tracked against a performance SLO.
Feedback loop audit — documented analysis of whether the model's decisions influence its own training data, with explicit mitigation (exploration policy or counterfactual logging) if so.
Retraining trigger — automated retraining workflow (Argo, SageMaker Pipelines) wired to drift alerts, not just run on a fixed schedule.
Champion/challenger shadow evaluation — every new model version runs in shadow mode against live traffic for a minimum evaluation window before promotion.