Kubernetes Workloads & Configuration

Jobs & CronJobs

18 min Lesson 7 of 32

Jobs & CronJobs

Not every workload is a long-running server. Data pipelines, database migrations, ML model training runs, report generation, cache warming, and nightly backups all share a fundamental shape: they start, do finite work, and stop. Kubernetes Jobs and CronJobs are the purpose-built primitives for this class of workload, and at big-tech scale they run millions of batch tasks per day with precise completion guarantees that ordinary Deployments cannot offer.

A Job creates one or more Pods, ensures a specified number of them successfully terminate, and tracks that completion state in etcd — giving you a durable, queryable record that the work finished. A CronJob is a controller that creates a new Job object on a cron schedule. Understanding both, and the failure semantics between them, is essential for production reliability.

Anatomy of a Job

The critical fields on a Job spec are completions, parallelism, backoffLimit, and activeDeadlineSeconds. Every decision you make about these four fields directly determines how your batch workload behaves under failure.

# job-basic.yaml — a single-completion Job
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v3-12
  namespace: production
  labels:
    app: payments
    version: v3.12
spec:
  # How many Pods must complete successfully (default 1)
  completions: 1

  # How many Pods may run in parallel (default 1)
  parallelism: 1

  # Max Pod failures before the Job itself fails (default 6)
  backoffLimit: 3

  # Hard wall-clock deadline for the entire Job (seconds)
  # The Job is marked Failed if it has not finished in this time
  activeDeadlineSeconds: 300

  # How long to keep completed Job objects (and their Pods) for log access
  # Without this, completed Jobs accumulate and fill up etcd
  ttlSecondsAfterFinished: 3600

  template:
    metadata:
      labels:
        job-name: db-migration-v3-12
    spec:
      # CRITICAL: never use Always (the default for Deployments)
      # OnFailure: restart the container in place
      # Never: create a new Pod on failure (enables crash logs inspection)
      restartPolicy: Never

      containers:
      - name: migrator
        image: payments-migrator:v3.12
        command: ["python", "manage.py", "migrate", "--no-input"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"

restartPolicy must be Never or OnFailure — never Always: Deployments default to restartPolicy: Always, which re-starts the container indefinitely. For a Job, that would mean a failing migration container loops forever inside the same Pod — the Job never fails, and backoffLimit never triggers. Never is the production preference: each failure creates a new Pod, giving you a preserved crash log (readable with kubectl logs <failed-pod>) and triggering proper backoff. OnFailure restarts in place, which saves Pod-creation overhead but overwrites the previous container logs.

Parallel Jobs: completions and parallelism

The real power of Jobs becomes apparent when you need to process a large workload in parallel. Kubernetes provides three Job patterns, selected by the combination of completions and parallelism:

Single-completion Job (completions: 1, parallelism: 1): One Pod runs; when it exits 0, the Job is done. Use for migrations, one-shot scripts.
Fixed-completion-count Job (completions: N, parallelism: K): Kubernetes runs up to K Pods in parallel and keeps spawning new ones until N successful completions accumulate. Use for processing N work items when each Pod handles exactly one item (the item identity is passed via an environment variable or work-queue index). As of Kubernetes 1.24, the completionMode: Indexed field gives each Pod a unique zero-based index in JOB_COMPLETION_INDEX, eliminating the need for a separate work-queue.
Work-queue Job (completions unset, parallelism: K): Pods pull items from an external queue (SQS, RabbitMQ, Redis list). The Job completes when any Pod exits successfully and all others finish or are terminated. Use when item count is not known in advance.

# job-indexed.yaml — Indexed parallel Job (Kubernetes 1.24+)
# Process 500 customer report segments, 20 at a time
apiVersion: batch/v1
kind: Job
metadata:
  name: reports-q4-2024
  namespace: data-platform
spec:
  completions: 500
  parallelism: 20
  completionMode: Indexed        # each Pod gets JOB_COMPLETION_INDEX env var (0..499)
  backoffLimit: 5
  activeDeadlineSeconds: 7200   # 2-hour wall-clock limit for the entire batch
  ttlSecondsAfterFinished: 86400

  template:
    spec:
      restartPolicy: Never
      containers:
      - name: reporter
        image: data-platform/report-generator:1.8.3
        command:
        - /bin/sh
        - -c
        - |
          SEGMENT=${JOB_COMPLETION_INDEX}
          echo "Processing segment ${SEGMENT}"
          python generate_report.py --segment "${SEGMENT}" --output s3://reports/q4-2024/
        env:
        - name: AWS_REGION
          value: us-east-1
        resources:
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "2"
            memory: "2Gi"

A fixed-completion Job (completions=4, parallelism=2) running two waves of Pods, with one retry consumed against backoffLimit.

Retry Semantics and Failure Modes

Understanding exactly when and how a Job fails is the difference between a reliable batch system and one that silently never finishes. There are two distinct failure conditions:

Pod failure: A Pod exits with a non-zero code, is killed by OOM, or is evicted. This increments the Job's failure counter. Once the counter exceeds backoffLimit, the Job transitions to Failed and no more Pods are started.
Deadline exceeded: activeDeadlineSeconds is reached regardless of the failure counter. This takes absolute precedence — a Job with backoffLimit: 1000 but activeDeadlineSeconds: 60 will be killed after 60 seconds even if only one failure has occurred.

Kubernetes 1.26 introduced Pod failure policy (podFailurePolicy), which gives fine-grained control: you can tell the Job to ignore certain exit codes (e.g., treat exit code 137 — OOM kill — as a transient failure worth retrying) or to immediately fail the Job on specific exit codes (e.g., exit code 42 = "data corruption detected, abort immediately"). This is the production-grade approach used at Google and similar organisations.

# job-with-failure-policy.yaml (Kubernetes 1.26+)
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
  namespace: data-platform
spec:
  completions: 100
  parallelism: 10
  backoffLimit: 4
  activeDeadlineSeconds: 3600

  # Fine-grained failure control (1.26+)
  podFailurePolicy:
    rules:
    # Exit code 42 = business logic "unrecoverable error": fail the whole Job immediately
    - action: FailJob
      onExitCodes:
        containerName: processor
        operator: In
        values: [42]

    # OOM kill (137) or SIGTERM (143): do NOT count as a backoffLimit failure,
    # just restart — these are infrastructure faults, not application bugs
    - action: Ignore
      onExitCodes:
        containerName: processor
        operator: In
        values: [137, 143]

    # Everything else: count against backoffLimit (default behaviour)

  template:
    spec:
      restartPolicy: Never
      containers:
      - name: processor
        image: data-platform/processor:2.1.0
        command: ["python", "process.py"]

Always set activeDeadlineSeconds on production Jobs: Without it, a stuck Job (network partition, deadlock, infinite retry) runs forever and blocks cluster resources. Setting a deadline makes worst-case behaviour explicit and bounded. A good rule of thumb is 3x the expected wall-clock runtime — enough headroom for retries and slow infra, but not unbounded.

CronJobs

A CronJob is a Job factory driven by a cron schedule. Every tick of the schedule creates a new Job object, which in turn creates Pods. The controller records the last schedule times, computes missed runs on restart, and enforces limits on how many concurrent and historical Job objects to keep.

# cronjob-nightly-report.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-revenue-report
  namespace: analytics
spec:
  # Standard 5-field cron: minute hour day-of-month month day-of-week
  # All times are UTC unless the cluster has a timezone setting
  schedule: "0 2 * * *"           # 02:00 UTC every day

  # What to do if a new run is due but the previous one is still running:
  # Allow: run anyway (can cause parallel runs — dangerous for non-idempotent jobs)
  # Forbid: skip this run entirely
  # Replace: kill the running Job and start fresh
  concurrencyPolicy: Forbid

  # Kubernetes 1.27+: IANA timezone — no more manual UTC offset math
  timeZone: "America/New_York"

  # If the CronJob misses its start time by more than this many seconds, skip
  # Prevents stale runs from catching up after a long cluster outage
  startingDeadlineSeconds: 3600

  # How many completed/failed Job records to keep (for log access via kubectl)
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800   # 30-minute hard limit per run
      ttlSecondsAfterFinished: 86400

      template:
        spec:
          restartPolicy: Never
          serviceAccountName: analytics-job-sa
          containers:
          - name: reporter
            image: analytics/revenue-reporter:3.4.1
            command: ["python", "generate_revenue_report.py", "--date", "yesterday"]
            envFrom:
            - secretRef:
                name: analytics-db-credentials
            resources:
              requests:
                cpu: "500m"
                memory: "1Gi"
              limits:
                cpu: "2"
                memory: "4Gi"

concurrencyPolicy: Allow is almost always wrong for stateful jobs: If your nightly report writes to a database table and the previous run is still executing when the new one starts, you get two writers fighting over the same output. The result is either duplicate rows, data corruption, or a deadlock. The only safe concurrency policy for jobs that mutate shared state is Forbid. Reserve Allow for fully stateless, idempotent operations like sending a heartbeat ping.

Operational Commands

Day-to-day operations on Jobs and CronJobs require a small set of essential commands:

# --- Inspecting Jobs ---
kubectl get jobs -n data-platform
kubectl get jobs -n data-platform -w                   # watch live
kubectl describe job reports-q4-2024 -n data-platform

# Check how many completions have accumulated
kubectl get job reports-q4-2024 -n data-platform \
  -o jsonpath='{.status.succeeded}/{.spec.completions}'

# View logs from all Pods of a Job (requires the job-name label)
kubectl logs -l job-name=reports-q4-2024 -n data-platform --tail=50

# --- Triggering a CronJob manually (for debugging) ---
kubectl create job --from=cronjob/nightly-revenue-report manual-run-$(date +%s) \
  -n analytics

# --- Suspending a CronJob (pause without deleting) ---
kubectl patch cronjob nightly-revenue-report -n analytics \
  -p '{"spec":{"suspend":true}}'

# Resume:
kubectl patch cronjob nightly-revenue-report -n analytics \
  -p '{"spec":{"suspend":false}}'

# --- Cleaning up a failed Job and its Pods ---
kubectl delete job db-migration-v3-12 -n production   # cascade-deletes Pods
# OR rely on ttlSecondsAfterFinished for automatic cleanup

# --- Checking missed CronJob runs ---
kubectl describe cronjob nightly-revenue-report -n analytics | grep -A10 "Events:"

ttlSecondsAfterFinished is mandatory at scale: Without it, completed and failed Job objects — and their associated Pods — accumulate indefinitely in etcd. At the scale of thousands of CronJobs running hourly, this can exhaust etcd storage, slow down the API server, and degrade kubectl get pods across the cluster. Always set this field. A sensible default is 86400 (24 hours) so that logs are available for a day of post-mortem investigation before automatic cleanup.

Production Patterns

Big-tech teams have converged on several practices for reliable batch workloads in Kubernetes:

Idempotency first: Every Job must be safe to re-run. If an item is processed twice, the result should be identical to processing it once. Use database upserts instead of inserts, write to temporary locations and atomic-rename to the final path, and store processed item IDs in a completion table.
Emit a metric on completion: Have each Job container push a Prometheus metric (via Pushgateway or a sidecar) recording completion time and success/failure. Set an alert on "no successful completion in the last N scheduled windows." Kubernetes Job status is in etcd, but it does not alert you — your observability stack must.
Use namespaces and RBAC to isolate batch workloads: A data-pipeline Job that misbehaves should not be able to touch production API servers. Dedicated namespaces with ServiceAccounts scoped to only the resources the Job needs is the standard pattern.
Version your Job names: Name Jobs with the code version and run date (e.g., migration-v3-12-20240115). This makes kubectl get jobs output self-documenting and ensures each deploy of a migration creates a distinct Job object.