Jenkins & Enterprise CI/CD

Operating Jenkins at Scale

18 min Lesson 9 of 28

Operating Jenkins at Scale

A Jenkins instance that serves five engineers with a handful of jobs is a toy. A Jenkins instance that serves 500 engineers running 10,000 builds per day is infrastructure — and it demands the same operational rigor you apply to any production database or Kubernetes cluster. This lesson covers the four disciplines that separate a well-run Jenkins deployment from one that collapses under growth: backup strategy, plugin management, Configuration as Code (JCasC), and high-availability architecture.

Backup Strategy: What Actually Needs Saving

Jenkins stores almost everything on disk in $JENKINS_HOME. Before writing backup scripts, understand what lives there and what the recovery cost of losing each piece is:

config.xml — the master configuration file (security realm, authorizations, global tool settings). Losing this means reconfiguring Jenkins from scratch.
jobs/ — every job definition. Losing this means losing all pipeline configs, build triggers, and job history.
credentials.xml and the secrets/ directory — encrypted credentials. Losing this breaks every pipeline that authenticates to anything.
plugins/ — installed plugin .jpi files. You can reinstall these but the process is time-consuming and version-sensitive.
users/ — local user accounts (if using Jenkins' own user database).
builds/ subdirectories — historical build logs and artifacts. These are often the largest data and may be acceptable to lose, depending on your audit requirements.

Production pitfall: Many teams back up $JENKINS_HOME with a naive tar while Jenkins is running. The result is a corrupted backup. Jenkins writes to several files continuously — especially the build queue and fingerprint database. Always quiesce Jenkins before snapshotting, or use a filesystem-level snapshot (e.g., LVM or EBS snapshot) that is instantaneous.

The recommended backup workflow at scale uses the Thin Backup plugin or a custom script that calls the Jenkins quiet-down API before taking a snapshot:

#!/bin/bash
# Safe Jenkins backup script
# Run from cron; requires JENKINS_URL and JENKINS_TOKEN in environment

set -euo pipefail

JENKINS_HOME=/var/lib/jenkins
BACKUP_DIR=/mnt/backup/jenkins
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

# 1. Quiesce Jenkins (no new builds start)
curl -sf -X POST "${JENKINS_URL}/quietDown" \
  --user "backup-bot:${JENKINS_TOKEN}"

# Give in-flight builds up to 5 minutes to finish
for i in $(seq 1 30); do
  BUILDING=$(curl -sf "${JENKINS_URL}/computer/api/json?depth=2" \
    --user "backup-bot:${JENKINS_TOKEN}" | \
    python3 -c "import sys,json; d=json.load(sys.stdin); \
    print(any(e['idle']==False for c in d['computer'] for e in c.get('executors',[])))")
  [ "$BUILDING" = "False" ] && break
  sleep 10
done

# 2. Snapshot critical subdirectories only (skip large build logs)
tar -czf "${BACKUP_DIR}/jenkins-config-${TIMESTAMP}.tar.gz" \
  --exclude="${JENKINS_HOME}/jobs/*/builds" \
  --exclude="${JENKINS_HOME}/workspace" \
  --exclude="${JENKINS_HOME}/caches" \
  "${JENKINS_HOME}"

# 3. Cancel quiet-down so builds resume immediately
curl -sf -X POST "${JENKINS_URL}/cancelQuietDown" \
  --user "backup-bot:${JENKINS_TOKEN}"

echo "Backup complete: jenkins-config-${TIMESTAMP}.tar.gz"

At large scale, the better pattern is to treat $JENKINS_HOME as a persistent volume on a cloud-native storage tier (EBS, Persistent Disk, Azure Disk) and take daily volume snapshots. This is instantaneous, crash-consistent, and independent of Jenkins internals.

Plugin Management: The Root Cause of Most Outages

Jenkins' plugin ecosystem is its greatest strength and its most dangerous attack surface. Most Jenkins production outages are caused by one of three plugin failure modes: a plugin update that breaks an API another plugin depends on, a plugin that introduces a regression in pipeline execution, or a security vulnerability in an outdated plugin.

Pro practice: Pin every plugin to a specific version in version control. Treat plugin upgrades as code changes that must go through a staging environment. Never click "Update All" on a production Jenkins controller.

The Plugin Installation Manager Tool (PIMT) — jenkins-plugin-cli — lets you declare plugins in a text file and install an exact version set into a Docker image at build time. This is the production standard:

# plugins.txt — pinned plugin manifest (commit this file to git)
# Format: plugin-id:version
workflow-aggregator:596.v8c21c963d92d
git:5.2.1
credentials:1319.v7eb_51b_3a_c97b_
blueocean:1.27.9
job-dsl:1.87
configuration-as-code:1775.v810dc950b_514
kubernetes:4190.v0f7e7e
pipeline-utility-steps:2.16.2
timestamper:1.26

# Dockerfile — bake plugins at image build time
FROM jenkins/jenkins:2.440.3-lts-jdk21

USER root
RUN apt-get update && apt-get install -y curl

USER jenkins
COPY plugins.txt /usr/share/jenkins/ref/plugins.txt
RUN jenkins-plugin-cli --plugin-file /usr/share/jenkins/ref/plugins.txt \
    --latest false

When a plugin needs updating, update the version pin in plugins.txt, build a new image, deploy to staging, run your pipeline smoke tests, then promote to production. The upgrade is now a code review, not a GUI click.

Jenkins Configuration as Code (JCasC)

The Configuration as Code plugin transforms Jenkins' XML-based configuration into human-readable YAML that can be stored in git, reviewed, diffed, and applied automatically on startup. This solves the most persistent Jenkins operational problem: controller state drift — where the production controller has been clicked into a configuration that no one can reproduce.

JCasC GitOps loop: controller configuration is always reproducible from a git commit.

A production JCasC file for a Kubernetes-based Jenkins deployment looks like this:

# jenkins.yaml — stored in git; loaded by JCasC plugin on startup
# Secrets are injected via environment variables; never hardcoded here.
jenkins:
  systemMessage: "Managed by Configuration as Code — do not edit via UI"
  numExecutors: 0                  # controller runs no builds; agents only
  mode: EXCLUSIVE
  scmCheckoutRetryCount: 2

  securityRealm:
    ldap:
      configurations:
        - server: ldaps://ldap.corp.example.com:636
          rootDN: "dc=corp,dc=example,dc=com"
          userSearchBase: "ou=people"
          groupSearchBase: "ou=groups"
          managerDN: "cn=jenkins,ou=service-accounts,dc=corp,dc=example,dc=com"
          managerPasswordSecret: "${LDAP_MANAGER_PASSWORD}"   # env var

  authorizationStrategy:
    roleBased:
      roles:
        global:
          - name: "admin"
            permissions:
              - "Overall/Administer"
            assignments:
              - "jenkins-admins"           # LDAP group
          - name: "developer"
            permissions:
              - "Overall/Read"
              - "Job/Build"
              - "Job/Read"
            assignments:
              - "engineers"

  clouds:
    - kubernetes:
        name: "kubernetes"
        serverUrl: "https://kubernetes.default.svc"
        namespace: "jenkins"
        jenkinsUrl: "http://jenkins.jenkins.svc.cluster.local:8080"
        podRetention: "Never"
        templates:
          - name: "default-agent"
            label: "k8s-agent"
            containers:
              - name: "jnlp"
                image: "jenkins/inbound-agent:3248.v65ecb_254c298-1"
                resourceLimitCpu: "1000m"
                resourceLimitMemory: "2Gi"
                resourceRequestCpu: "500m"
                resourceRequestMemory: "1Gi"

unclassified:
  location:
    url: "https://jenkins.corp.example.com/"
    adminAddress: "jenkins-alerts@corp.example.com"

  slackNotifier:
    teamDomain: "mycompany"
    tokenCredentialId: "slack-token"

credentials:
  system:
    domainCredentials:
      - credentials:
          - usernamePassword:
              scope: GLOBAL
              id: "artifactory-bot"
              description: "Artifactory service account"
              username: "jenkins-bot"
              password: "${ARTIFACTORY_PASSWORD}"   # never store plaintext here

Key idea: JCasC separates structure (what kind of auth, what clouds, what job configs) from secrets (passwords, tokens). Structure goes in git. Secrets come from an external vault at runtime via environment variable injection. This pattern means your jenkins.yaml is safe to commit to a private repository.

High-Availability Considerations

Classic Jenkins has a fundamental HA limitation: the controller is a single point of failure. When it restarts, all running builds abort. When it is unavailable, no new builds start. For a 500-engineer org running a development lifecycle that depends on CI, controller downtime is a P1 incident.

There are three tiers of HA approach, in increasing order of complexity and cost:

Fast restart (most common): Run Jenkins as a container or systemd service with automatic restart on failure. Store $JENKINS_HOME on a persistent volume. Target RTO under 2 minutes. This covers 90% of incidents (controller OOM, crash, rolling upgrade).
Active/warm-standby: A second controller instance is kept warm, mounting the same persistent volume in read-only mode. On failure, the volume is re-mounted read-write on the standby. This requires a shared block-storage tier (AWS EFS, NFS, cloud-specific solutions). Build in-flight still abort, but new builds resume in under 30 seconds.
Jenkins HA (CloudBees CI): The commercial CloudBees distribution supports a true active-active HA configuration with a distributed build queue and no single-controller SPOF. This is what Netflix, Goldman Sachs, and similar firms use. The open-source Jenkins project does not have this capability.

Pro practice: At the point where Jenkins downtime triggers an escalation, the right answer is usually to migrate to a cloud-native CI system (GitHub Actions, Tekton, Argo Workflows) rather than invest further in Jenkins HA. Jenkins HA is an operational investment with diminishing returns. Jenkins excels at flexibility; it does not excel at zero-maintenance availability.

Regardless of HA tier, apply these operational hygiene practices at every scale:

Run the controller with zero executors (numExecutors: 0 in JCasC). The controller process should only orchestrate; all build work goes to agents. This keeps the controller stable and prevents noisy-neighbor build load from impacting the UI and API.
Set build discarders on every job — cap build history by count and/or age. Unbounded build history will fill the disk and slow the UI.
Monitor /metrics (Prometheus plugin) and alert on controller heap usage above 80%, executor queue depth, and disk pressure on $JENKINS_HOME.
Run periodic configuration export using JCasC: curl -X POST $JENKINS_URL/configuration-as-code/export and diff the output against your pinned jenkins.yaml. Any drift means someone clicked the UI and your IaC is stale.

Together, these four disciplines — tested backups, pinned plugins, JCasC-managed configuration, and appropriate HA architecture — transform Jenkins from a fragile shared service into reliable, auditable CI infrastructure that can survive on-call rotations and company growth without heroic interventions.