MLOps & DevOps for AI Systems

Training Infrastructure & GPUs

18 min Lesson 4 of 28

Training Infrastructure & GPUs

Running a single training job on a laptop is a science experiment. Running training at scale — hundreds of concurrent experiments, multi-node distributed jobs consuming petabytes of data, with GPU costs tracked to the dollar — is an engineering discipline. This lesson covers the three axes that a senior MLOps engineer must master: scheduling GPU workloads on Kubernetes, exploiting spot/preemptible capacity to cut costs by 60–80%, and the distributed training primitives that let a single job span dozens of nodes. You already know Kubernetes deeply; here we extend that knowledge into the ML infrastructure plane.

GPU Scheduling on Kubernetes

Kubernetes treats GPUs as extended resources under the device plugin framework (first stabilized in k8s 1.10). The NVIDIA device plugin — deployed as a DaemonSet — advertises nvidia.com/gpu resources on each GPU node. Requesting a GPU is syntactically identical to requesting CPU or memory, which means all your existing RBAC, namespace quotas, and scheduling primitives work unchanged.

The critical production detail: GPU resources are not fractional by default. A pod requesting nvidia.com/gpu: 1 gets exclusive access to a full physical GPU. If a node has 8xA100s and your pod requests 1, seven GPUs sit idle unless seven other pods are co-scheduled. This leads to low GPU utilization — one of the costliest waste patterns in ML infrastructure. The solutions are time-slicing (NVIDIA MIG on A100/H100) or virtual GPU sharing, but both come with caveats covered below.

# 1. Deploy the NVIDIA device plugin DaemonSet via Helm helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm upgrade --install nvdp nvdp/nvidia-device-plugin \ --namespace kube-system \ --set migStrategy=single \ --set failOnInitError=false # 2. Label GPU node pools so you can target them precisely kubectl label node gpu-node-1 accelerator=nvidia-tesla-a100 node-type=gpu # 3. A minimal single-GPU training Job (YAML) # apiVersion: batch/v1 # kind: Job # metadata: # name: train-resnet50-v12 # namespace: ml-training # spec: # backoffLimit: 2 # template: # spec: # restartPolicy: OnFailure # tolerations: # - key: nvidia.com/gpu # operator: Exists # effect: NoSchedule # nodeSelector: # accelerator: nvidia-tesla-a100 # containers: # - name: trainer # image: registry.example.com/ml/trainer:2.3.1 # command: ["python", "train.py", "--epochs=50", "--batch-size=256"] # resources: # limits: # nvidia.com/gpu: "1" # memory: "60Gi" # requests: # nvidia.com/gpu: "1" # memory: "60Gi" # Check GPU resource allocation on your cluster: kubectl describe node gpu-node-1 | grep -A10 "Allocatable" kubectl get pods -n ml-training -o wide
MIG (Multi-Instance GPU): NVIDIA A100 and H100 support Multi-Instance GPU, which partitions a single physical GPU into up to 7 isolated instances, each with dedicated memory and compute. A 40GB A100 can expose seven 5GB MIG instances, letting you schedule seven small inference or fine-tuning jobs simultaneously. Enable it with nvidia-smi -i 0 -mig 1 and configure profiles with nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C for seven equal 1g.5gb slices. Coordinate MIG profiles with the device plugin by setting migStrategy=mixed in the Helm values and using resource names like nvidia.com/mig-1g.5gb in pod specs. At Google and Meta, MIG is the default for inference serving; full-GPU allocation is reserved for large training runs.

The GPU Scheduling Pipeline on a Real Cluster

At big-tech scale, GPU clusters run a batch scheduling layer on top of Kubernetes. The two dominant choices are Volcano (CNCF) and Yunikorn (Apache), both of which add gang scheduling — the guarantee that all pods in a job start simultaneously or none start at all. This is critical for distributed training: if 15 of 16 pods start but one cannot be scheduled because a node is being drained, the 15 running pods spin in a barrier wait, burning GPU-hours and accomplishing nothing.

The diagram below shows how a Kubeflow training operator request flows through the scheduling stack from submission to GPU allocation.

GPU Scheduling Stack on Kubernetes ML Engineer kubectl / SDK Training Operator PyTorchJob / TFJob Volcano Scheduler Gang / Preemption k8s Scheduler (non-GPU pods) GPU Node Pool A100 / H100 nodes Device Plugin nvidia.com/gpu Spot GPU Nodes Preemptible / Cheap Checkpoint Store (S3 / GCS) Resume on preemption submit PodGroup spot bid on-demand
GPU scheduling stack: the Training Operator creates a PodGroup; Volcano gang-schedules it across on-demand or spot GPU nodes; the device plugin exposes GPU resources; checkpoints to object storage enable spot recovery.

Spot Training: 60–80% Cost Reduction

GPU instances are expensive. An 8xA100 on-demand node on AWS (p4d.24xlarge) costs roughly $32/hour. The spot equivalent — EC2 Spot or GCP preemptible — costs $10–13/hour, a 60–70% reduction. At companies running thousands of GPU-hours per day, this difference is millions of dollars per year. The catch: a spot node can be reclaimed by the cloud provider with 2 minutes notice (EC2) or 30 seconds (GCP). A training job running for 10 hours with no fault tolerance loses everything when its node evicts.

The production pattern to make spot training viable is periodic checkpointing to durable object storage. A training job that checkpoints every 10 minutes loses at most 10 minutes of work on an eviction. Combined with the Kubernetes Job backoffLimit and proper restart logic, the job resumes from the last checkpoint automatically. The overhead of checkpointing a large model — saving weights, optimizer state, and scheduler state — is typically 30–120 seconds for a 7B-parameter model, acceptable at 10-minute intervals.

# PyTorch training loop with periodic checkpointing to S3 # This is the pattern Google Brain, Meta FAIR, and Databricks all use internally. import torch, os, boto3, time CHECKPOINT_BUCKET = "ml-checkpoints-prod" CHECKPOINT_KEY = "runs/resnet50-v12/ckpt-latest.pt" CHECKPOINT_EVERY = 600 # seconds def save_checkpoint(model, optimizer, scheduler, epoch, step, loss): state = { "epoch": epoch, "step": step, "model_state": model.state_dict(), "optimizer_state": optimizer.state_dict(), "scheduler_state": scheduler.state_dict(), "loss": loss, } tmp = "/tmp/ckpt-latest.pt" torch.save(state, tmp) s3 = boto3.client("s3") s3.upload_file(tmp, CHECKPOINT_BUCKET, CHECKPOINT_KEY) print(f"[ckpt] saved epoch={epoch} step={step} loss={loss:.4f}") def load_checkpoint(model, optimizer, scheduler): s3 = boto3.client("s3") tmp = "/tmp/ckpt-resume.pt" try: s3.download_file(CHECKPOINT_BUCKET, CHECKPOINT_KEY, tmp) state = torch.load(tmp, map_location="cuda") model.load_state_dict(state["model_state"]) optimizer.load_state_dict(state["optimizer_state"]) scheduler.load_state_dict(state["scheduler_state"]) print(f"[ckpt] resumed from epoch={state['epoch']} step={state['step']}") return state["epoch"], state["step"] except Exception as e: print(f"[ckpt] no checkpoint found, starting fresh: {e}") return 0, 0 # In your training loop: last_ckpt = time.time() for epoch in range(start_epoch, num_epochs): for step, batch in enumerate(dataloader, start=start_step): loss = train_step(model, optimizer, batch) if time.time() - last_ckpt >= CHECKPOINT_EVERY: save_checkpoint(model, optimizer, scheduler, epoch, step, loss) last_ckpt = time.time()
Spot eviction cascade: In a distributed training job (e.g. 16-node PyTorch DDP), if one spot node is evicted, all other nodes block at the next dist.barrier() call and effectively stall. You must configure SIGTERM handlers that trigger a checkpoint and graceful shutdown of the entire job before the 2-minute eviction window expires. On AWS, use the EC2 instance metadata service to poll for spot interruption notices (curl http://169.254.169.254/latest/meta-data/spot/instance-action every 5 seconds from a sidecar) and send a SIGTERM to the training process when a notice appears.

Node Pool Architecture: On-Demand vs. Spot

The standard production architecture uses two GPU node pools in the same cluster. A small on-demand pool (1–4 nodes) handles short, high-priority jobs — final evaluation runs, production model retraining on a deadline, anything under 30 minutes. A large spot pool (auto-scaling, 0–50 nodes) handles exploratory experiments, hyperparameter sweeps, and long preemptible training runs. Volcano's PriorityClass and queue system enforces the boundary: jobs tagged tier: research go to spot; jobs tagged tier: production go to on-demand.

On EKS, configure spot nodes using a dedicated managed node group with capacityType: SPOT and a list of instance types (not just one) so that the autoscaler can fall back across instance families when spot capacity for a single type is exhausted:

# eksctl cluster config snippet for a mixed on-demand + spot GPU setup # (excerpt — assumes VPC, subnets, and IAM roles are already defined) managedNodeGroups: - name: gpu-ondemand instanceType: p4d.24xlarge capacityType: ON_DEMAND minSize: 0 maxSize: 4 labels: node-type: gpu-ondemand taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule - name: gpu-spot instanceTypes: - p4d.24xlarge - p3.16xlarge - p3dn.24xlarge capacityType: SPOT minSize: 0 maxSize: 50 labels: node-type: gpu-spot taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule - key: spot value: "true" effect: NoSchedule # Karpenter (preferred over Cluster Autoscaler for ML workloads) # NodePool for spot GPU with fallback: # apiVersion: karpenter.sh/v1 # kind: NodePool # spec: # template: # spec: # requirements: # - key: karpenter.sh/capacity-type # operator: In # values: ["spot", "on-demand"] # spot preferred; on-demand fallback # - key: node.kubernetes.io/instance-type # operator: In # values: ["p4d.24xlarge", "p3.16xlarge", "p3dn.24xlarge"] # disruption: # consolidationPolicy: WhenEmpty # budgets: # - nodes: "10%" # never evict more than 10% simultaneously

Distributed Training Basics

When a single GPU does not provide enough memory or throughput to train a model in a reasonable time, training is distributed across multiple GPUs and nodes. There are three axes of parallelism, and large models use all three simultaneously (3D parallelism):

  • Data parallelism (DDP): Each GPU holds a full model replica and processes a different mini-batch. Gradients are averaged across replicas via an AllReduce collective at the end of each backward pass. PyTorch's DistributedDataParallel is the standard. This scales well up to ~32 GPUs, after which gradient communication overhead dominates.
  • Tensor parallelism: A single layer's weight matrix is split across GPUs. Forward pass requires communication mid-layer. Used for very wide transformer models. Megatron-LM (NVIDIA) is the reference implementation.
  • Pipeline parallelism: Different layers are assigned to different GPUs. The model is split vertically. Requires careful micro-batching to avoid GPU stalls. DeepSpeed uses this for models that exceed single-GPU memory.

For most production ML teams below LLM scale, DDP is the starting point and the only form of parallelism needed. A PyTorchJob on Kubeflow orchestrates DDP across nodes by injecting MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables into each pod — the same environment that torchrun expects.

Rule of thumb for scaling: DDP scales linearly up to the point where gradient synchronization latency exceeds the backward pass compute time. On NVLink-connected GPUs within a single node this threshold is rarely hit. Across nodes via InfiniBand (100–400 Gbps), a 16-node DDP job with a ResNet-50 model stays nearly linear. With a 70B-parameter model, gradient tensors alone are several GB per step — at that point, tensor or pipeline parallelism is unavoidable.

The Kubernetes-native way to launch a distributed PyTorch job is the Training Operator (formerly Kubeflow Training Operator). A PyTorchJob resource declares the master and worker replicas; the operator handles pod creation, environment injection, failure recovery, and gang scheduling coordination with Volcano:

# PyTorchJob: 1 master + 7 workers = 8 pods, each using 1 A100 # The operator injects MASTER_ADDR, RANK, WORLD_SIZE automatically. apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: ddp-train-llama-7b-finetune namespace: ml-training spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: pytorch image: registry.example.com/ml/trainer-ddp:3.1.0 command: - torchrun - --nproc_per_node=1 - train_ddp.py - --model=llama-7b - --dataset=s3://datasets-prod/finetune-v4 resources: limits: nvidia.com/gpu: "1" memory: "180Gi" Worker: replicas: 7 restartPolicy: OnFailure template: spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: pytorch image: registry.example.com/ml/trainer-ddp:3.1.0 command: - torchrun - --nproc_per_node=1 - train_ddp.py - --model=llama-7b - --dataset=s3://datasets-prod/finetune-v4 resources: limits: nvidia.com/gpu: "1" memory: "180Gi"

Observability for Training Jobs

GPU training jobs have a different observability surface than typical service pods. The key metrics to export to Prometheus are GPU utilization (DCGM_FI_DEV_GPU_UTIL), GPU memory used (DCGM_FI_DEV_FB_USED), and NVLink bandwidth (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL). The NVIDIA DCGM Exporter (deployed as a DaemonSet alongside the device plugin) exposes all of these in standard Prometheus format. A training job where GPU utilization drops below 60% over a sustained period is a signal of a data loading bottleneck — typically the DataLoader is the CPU bottleneck, not the model itself. Increasing num_workers or prefetching to GPU memory resolves this in most cases.

The most common training job failure you will debug: GPU utilization shows 100% for a few seconds then drops to 5% repeatedly — this is the DataLoader starvation pattern. Profile with PyTorch Profiler (torch.profiler.profile), and look for long DataLoaderWorker spans. The fix is almost always one of: increase DataLoader num_workers, use a faster storage backend (FSx for Lustre instead of S3 FUSE mounts), or pre-tokenize and cache your dataset to avoid online preprocessing.

Production Judgment: On-Demand vs. Spot Decision Matrix

Not all training jobs should go to spot. The decision depends on three factors: job duration, checkpoint overhead, and deadline sensitivity. A hyperparameter sweep of 200 short 20-minute experiments is ideal for spot — each experiment is small enough to complete between interruptions and cheap enough to retry. A final pre-production training run that must complete in 8 hours for a Monday release should go to on-demand with guaranteed capacity reservation. The engineering judgment is: if expected cost of interruption and retry (wasted GPU-hours plus operational overhead) exceeds the spot discount, use on-demand.