MLOps & DevOps for AI Systems

Model Serving Patterns

18 min Lesson 6 of 28

Model Serving Patterns

Getting a trained model out of your experiment environment and into a path where it actually serves predictions to users is deceptively hard. The training job finished successfully, the evaluation metrics passed your thresholds, the model was promoted in the registry — and now the real work begins. Serving a model at production scale requires you to reason simultaneously about latency budgets, throughput ceilings, hardware cost, deployment topology, and failure semantics. This lesson covers the three canonical serving patterns you will encounter and the infrastructure decisions that determine whether each one fits your use case.

Real-Time Serving (Online Inference)

Real-time serving answers a prediction request synchronously within a latency budget — typically single-digit milliseconds for simple models to a few hundred milliseconds for large deep learning models. The client blocks on the response. This is the right pattern when the prediction is needed at request time: fraud detection on a payment, product recommendation on a page load, content moderation on a post submission.

The serving stack for real-time inference has three layers you already know from conventional microservices, plus a fourth that is ML-specific:

Load balancer / API gateway: Routes prediction requests, enforces auth, applies rate limits, and provides the external HTTPS endpoint. No change from your existing services.
Model server: The process that holds the model weights in memory (GPU VRAM or CPU RAM) and performs the forward pass. Production-grade model servers — Triton Inference Server, TorchServe, TF Serving, vLLM — expose gRPC and REST endpoints, support batching, handle multiple model versions concurrently, and emit Prometheus metrics. Do not wrap a raw Python script in a Flask app and call it a model server; that pattern collapses under any real load.
Feature store (online path): For models that need pre-computed features not present in the request payload, the model server fetches them from a low-latency feature store (Redis, Feast online store, Tecton). This is the single biggest latency contributor outside the inference itself; keep it sub-5 ms.
Model registry integration: The serving layer watches for new model versions in the registry and hot-swaps them without downtime. Triton supports this natively; TF Serving polls the model directory.

Real-time serving answers a synchronous request; batch serving pre-computes predictions and writes them to a store for later lookup.

Batch Serving (Offline Inference)

Batch serving runs the model as a scheduled job over a large corpus of inputs and writes the results to a data store. Consumers then look up pre-computed predictions at query time instead of invoking the model. This pattern is correct when you can afford to pre-compute — personalized email content generated nightly, weekly churn scores on all customer accounts, overnight document classification across a document management system.

The trade-off is freshness. A batch job that runs every 24 hours means predictions are at most 24 hours stale. For many business problems that is fine. For fraud detection, it is not. Choose batch when:

The input set is finite and enumerable (all users, all products, all documents).
The prediction is valid for long enough that staleness does not cause harm.
The model is too large or too slow to serve in real time within the latency budget.
Cost is a constraint — batch inference on spot/preemptible GPU instances is typically 3-10x cheaper than always-on real-time endpoints.

At scale, batch inference runs on Spark (PySpark with pandas_udf to vectorize model calls), Ray Batch, or a simple Kubernetes Job that parallelizes over input shards. Outputs land in BigQuery, DynamoDB, Redis, or a feature store's offline path, depending on the lookup pattern.

Key idea: Real-time and batch are not mutually exclusive. Many production systems use both: batch to pre-compute baseline scores cheaply, and real-time inference to update that score with the latest in-session context at request time. This is the lambda serving architecture — named after the lambda data architecture — and it is the default pattern for recommendation systems at Google, Meta, and Netflix.

Serving Frameworks in Production

NVIDIA Triton Inference Server is the industry default for GPU workloads. It supports TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, and custom Python backends. Triton's ensemble scheduling allows you to chain pre-processing, inference, and post-processing as a single logical request. Dynamic batching aggregates requests arriving within a configurable window into a single GPU kernel invocation, dramatically improving GPU utilization without the client knowing.

# Deploy Triton on Kubernetes (Helm)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install triton nvidia/triton-inference-server \
  --namespace mlserving --create-namespace \
  --set image.imageName=nvcr.io/nvidia/tritonserver:24.08-py3 \
  --set resources.limits."nvidia.com/gpu"=2 \
  --set modelRepository.storageType=gcs \
  --set modelRepository.path=gs://my-model-registry/triton-models

# Model repository layout (GCS):
# gs://my-model-registry/triton-models/
#   fraud-detector/
#     config.pbtxt
#     1/
#       model.onnx

# config.pbtxt — dynamic batching config
cat <<'EOF' > fraud-detector/config.pbtxt
name: "fraud-detector"
platform: "onnxruntime_onnx"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32, 64]
  max_queue_delay_microseconds: 2000
}
input [{ name: "input_features" data_type: TYPE_FP32 dims: [128] }]
output [{ name: "fraud_score"   data_type: TYPE_FP32 dims: [1]   }]
EOF

vLLM is the standard for serving large language models. It implements PagedAttention — KV cache management borrowed from OS virtual memory — to eliminate memory waste from variable-length sequences. On a single A100 80GB, vLLM typically achieves 3-5x higher throughput than a naive HuggingFace generate() loop.

TF Serving is the canonical choice for TensorFlow SavedModel deployments. It is simpler than Triton and production-proven across Google's internal infrastructure. If your org is TF-heavy and you do not need multi-framework support, TF Serving is the lower-ops choice.

Ray Serve fills the gap when you need Python-first flexibility: custom pre/post-processing logic, ensemble models, or deployment graphs with branches. It integrates with the broader Ray ecosystem (Ray Train, Ray Tune) for end-to-end MLOps on a single cluster.

Autoscaling Inference

Inference workloads are bursty. A recommendation service sees 10x traffic spikes during peak shopping hours. A fraud model sees a surge whenever a large merchant runs a promotion. Static provisioning to handle peaks means spending GPU budget on idle capacity during off-peak hours. At $2–4/hr per A10G GPU, that idle cost accumulates fast.

Kubernetes Horizontal Pod Autoscaler (HPA) works for CPU-bound models but is blind to the GPU-specific signals that matter most: GPU utilization, batch queue depth, and pending request count. The production answer is KEDA (Kubernetes Event-Driven Autoscaling) combined with custom metrics from your model server:

# KEDA ScaledObject — scale Triton on pending request queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: triton-fraud-scaler
  namespace: mlserving
spec:
  scaleTargetRef:
    name: triton-fraud-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
  cooldownPeriod: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: triton_pending_request_count
        query: |
          sum(nv_inference_queue_duration_ms{model="fraud-detector"}) /
          sum(nv_inference_count{model="fraud-detector"})
        threshold: "50"   # scale up when avg queue latency > 50 ms

For cost-sensitive or bursty workloads, consider scale-to-zero. Knative Serving and AWS SageMaker Serverless Inference both support true scale-to-zero: the deployment shrinks to zero replicas when idle and scales up on the first incoming request (cold-start latency: 1–30 seconds depending on model size). This is viable for models with infrequent traffic; it is catastrophically wrong for latency-sensitive paths where the cold start would breach your SLO.

Production practice: For GPU inference pods, set resources.requests.nvidia.com/gpu equal to resources.limits.nvidia.com/gpu. GPU resources in Kubernetes are not compressible — the scheduler either fits the pod on a node with a free GPU or it does not. Setting a lower request than limit results in unpredictable scheduling and GPU sharing conflicts. Always set them equal.

Load Testing Your Serving Stack

Never promote a model to real-time serving without a load test against a staging replica that matches production GPU type and memory. Triton's perf_analyzer is purpose-built for this; alternatively, k6 with a custom model request script works against any REST or gRPC endpoint.

# Triton perf_analyzer — baseline latency and throughput
perf_analyzer \
  -m fraud-detector \
  -u triton-staging.internal:8001 \
  --protocol grpc \
  --concurrency-range 1:64:8 \
  --measurement-interval 10000 \
  -b 32 \
  --percentile 99

# Expected output columns:
# Concurrency | Inferences/sec | p50 latency (ms) | p99 latency (ms) | GPU util %
# Tune dynamic_batching preferred_batch_size until p99 < your SLO

# k6 script for REST endpoint load test (k6 run load-test.js)
# load-test.js:
# import http from 'k6/http';
# import { check } from 'k6';
# export const options = { vus: 50, duration: '120s', thresholds: { 'http_req_duration': ['p(99)<200'] } };
# export default function () {
#   const payload = JSON.stringify({ inputs: [{ name: "input_features", datatype: "FP32", shape: [1,128], data: Array(128).fill(0.1) }] });
#   const res = http.post('http://triton-staging.internal:8000/v2/models/fraud-detector/infer', payload, { headers: { 'Content-Type': 'application/json' } });
#   check(res, { 'status 200': (r) => r.status === 200 });
# }

Production pitfall — model size vs replica count: Teams routinely underestimate model VRAM requirements and schedule too many replicas per GPU node, triggering out-of-memory kills. Always measure peak VRAM consumption under your target batch size with nvidia-smi during load testing before setting maxReplicaCount. A rule of thumb: leave 10-15% VRAM headroom for KV cache growth, CUDA kernel overhead, and concurrent model versions.

Canary Rollouts for Models

Deploying a new model version should follow the same canary discipline you apply to application code — but with the added complexity that model quality can degrade silently on subpopulations invisible in aggregate metrics. Use your service mesh (Istio, Linkerd) or your model server's traffic-splitting feature to route 5% of production traffic to the new version. Monitor both infrastructure metrics (latency, error rate) and model quality metrics (prediction score distribution, business KPIs) before promoting. If the new model's score distribution shifts significantly relative to the incumbent — measured with a KL divergence check in your monitoring pipeline — roll back immediately.

The combination of a model registry (lesson 3), a CI/CD pipeline that gates on evaluation metrics (lesson 5), a production serving layer with proper autoscaling, and drift monitoring (lesson 7) is what distinguishes a mature ML platform from a research project that happens to be running in production.