Secrets Management & PKI

The Secrets Problem

18 min Lesson 1 of 28

The Secrets Problem

Every production system has secrets: database passwords, API keys, TLS private keys, OAuth client credentials, SSH keys, cloud provider access tokens. The question is never whether your system has secrets — it does — but whether those secrets are managed deliberately or scattered accidentally across every surface your engineers have ever touched.

The majority of real-world breaches do not start with a zero-day exploit or a sophisticated attack. They start with a secret committed to a Git repository, baked into a Docker image, pasted into a CI environment variable with overly broad access, or left in a shell history file on a shared bastion host. This lesson maps the full attack surface so you understand exactly what you are defending against before you design a solution.

What is Secrets Sprawl?

Secrets sprawl is the condition where credentials exist in many places simultaneously, often without a centralized inventory, rotation schedule, or access audit trail. It happens organically: a developer hard-codes a database password to get something working, copies it to a CI environment variable, pastes it into a Slack DM to a colleague, and then forgets every location it landed.

At scale, sprawl looks like this: a 50-engineer company runs a scan of their GitHub history and finds 3,400 secrets across 120 repositories, many of them active credentials for production systems. This is not hypothetical — it is the median result of running a tool like trufflehog or git-secrets on an unmanaged codebase for the first time. Every organisation that has never run one of these scans should assume they are in this state.

Why sprawl is dangerous, not just messy: A secret in one place is a secret you can rotate. A secret in twelve places is a secret you cannot rotate safely — you do not know all the consumers, you cannot update them atomically, and you will inevitably break something, so you delay. Delay means the credential stays live for months or years after a suspected compromise.

The Six Leak Paths

Secrets escape into the wrong hands through a predictable set of vectors. Understanding each one is prerequisite to closing them.

1. Source Code Commits

The most common and most dangerous vector. A developer checks in a .env file, hardcodes a key in a config, or pushes a test file that contains a real production credential. Git remembers everything: even after a git rm and a new commit, the secret lives in the full history, cloneable by anyone with repo access — and if the repo is ever briefly public, indexed by GitHub's secret scanning and by external scrapers within minutes.

# WRONG — do NOT do this
# config/database.py
DB_PASSWORD = "hunter2_prod_2024!"
AWS_SECRET_KEY = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

# Checking if a secret is in Git history (run this on any repo you inherit)
git log --all --full-history -- "**/*.env" "**/*.pem" "**/credentials*"

# Scan for secrets in full history with trufflehog
trufflehog git file://. --only-verified

The correct approach is to use environment variables or a secrets manager and never let the credential touch the file system in plaintext. But the hard problem is that preventing the commit requires tooling, policy, and culture — not just intent. Even experienced engineers commit secrets under deadline pressure. Pre-commit hooks and CI scanning are not optional at professional scale; they are table stakes.

2. CI/CD Environment Variables

Environment variables in CI systems (GitHub Actions secrets, GitLab CI variables, Jenkins credentials store) feel safe because they are masked in logs. They are not. The masking is cosmetic: it replaces the literal string in log output, but the variable is still available to any step in the pipeline, including third-party actions or plugins you have pulled in without auditing. A malicious or compromised GitHub Action can exfiltrate every secret in your environment with a single outbound HTTP request.

# GitHub Actions — a compromised third-party action can read all secrets
# Even "masked" secrets are environment variables accessible to all steps
name: CI
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # This third-party action has access to ALL secrets below
      - uses: some-vendor/action@v1   # pinned? audited? maintained?

      - name: Build
        env:
          DB_PASS: ${{ secrets.DB_PASSWORD }}       # visible to all steps above
          AWS_KEY: ${{ secrets.AWS_SECRET_KEY }}    # same runner process

# BETTER: Scope secrets to the minimal step that needs them
# BETTER: Pin third-party actions to a full commit SHA, not a mutable tag
# BETTER: Use OIDC federation to eliminate long-lived credentials entirely

3. Container Images

Docker build processes are a classic credential sink. A developer adds an ARG or ENV to pull a private package, run a database migration, or authenticate an API call during build time. That value is then baked into one or more image layers and is trivially extractable by anyone who can pull the image:

# WRONG — credential baked into a layer
FROM node:20-alpine
ARG NPM_TOKEN
ENV NPM_TOKEN=$NPM_TOKEN
RUN echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > ~/.npmrc
RUN npm install

# Even if you delete .npmrc in the next layer, the token is in the ARG/ENV layer
# Extract it with:
docker history --no-trunc my-image:latest
# Or: docker save my-image:latest | tar xv

# CORRECT — use BuildKit secrets (never touches a layer)
# syntax=docker/dockerfile:1
FROM node:20-alpine
RUN --mount=type=secret,id=npm_token \
    NPM_TOKEN=$(cat /run/secrets/npm_token) && \
    echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > ~/.npmrc && \
    npm install && \
    rm ~/.npmrc

# Build with:
docker build --secret id=npm_token,env=NPM_TOKEN .

4. Logs and Monitoring Systems

Application logs are the third most common exfiltration point. Request logs that include full URLs capture API keys passed as query parameters (?api_key=abc123). Error reports include stack traces that dump environment variables. Distributed tracing headers can carry auth tokens. All of this lands in Elasticsearch, Datadog, Splunk, or CloudWatch — systems with much weaker access control than your production secrets store.

5. Infrastructure State Files

Terraform state files (terraform.tfstate) contain the full plaintext output of every resource created — including generated passwords, private keys, and database connection strings. A state file stored in an S3 bucket without encryption or proper IAM policies is a complete credential dump of your infrastructure. The same applies to Ansible vault files stored with weak passphrases, CloudFormation outputs written to Parameter Store without encryption, and Helm chart values files checked into Git.

6. Shared Secrets Between People

Slack, email, Notion, 1Password shared vaults, and shared bastion host accounts all represent the human vector: secrets passed person-to-person leave a copy in every system they transited through. There is no reliable way to rotate a secret that has been shared over Slack — you do not know who has it, who exported the conversation, or whether it landed in a third-party Slack app's storage.

The six paths through which secrets escape into attacker hands.

The Real-World Cost: Three Canonical Incidents

These are not hypotheticals. They are the pattern of real breaches:

Uber (2022): Attacker obtained contractor credentials via SMS phishing, then found AWS keys hardcoded in an internal PowerShell script on a network share. Full access to production AWS environment, 57 million records exposed.
Toyota (2023): GitHub repository accidentally made public for five years contained credentials granting access to a data management server. 215,000 customer records exposed. The credential was never rotated because no one knew it was in the repo.
Codecov (2021): Supply chain attack modified the Codecov bash uploader script to exfiltrate all environment variables from any CI pipeline that ran it — capturing secrets from thousands of downstream companies including HashiCorp, Twilio, and Rapid7.

Git history is permanent until you rewrite it — and rewriting is dangerous. When a secret is committed to a public repo or a repo with more than one collaborator, assume it is compromised and rotate immediately. Running git filter-repo or BFG to remove the secret from history is a destructive rewrite that invalidates every clone and fork. The operational cost of rewriting history usually exceeds the cost of rotating the credential. Rotate first, rewrite never (or only if the credential cannot be rotated).

Quantifying Your Attack Surface: Detection First

Before building a secrets management system, you must understand the current state of your secrets sprawl. Run these tools against your codebase and infrastructure before your next sprint planning session:

# 1. Scan Git history for secrets (all branches, all time)
pip install trufflehog
trufflehog git file://. --only-verified --json | jq '.SourceMetadata.Data.Git'

# 2. GitHub's own secret scanning (enable in repo Settings > Security)
# Covers 200+ token types; alerts immediately on push

# 3. Detect secrets in a Docker image (all layers)
pip install detect-secrets
# Or use: docker save myimage:latest | tar xO | strings | grep -E 'AKIA[0-9A-Z]{16}'

# 4. Scan IaC / Terraform files
pip install checkov
checkov -d . --check CKV_SECRET_6   # checks for hardcoded secrets in TF

# 5. Pre-commit hook to block future commits (install once per developer machine)
pip install detect-secrets
detect-secrets scan > .secrets.baseline
cat > .pre-commit-config.yaml <<'EOF'
repos:
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
        args: ['--baseline', '.secrets.baseline']
EOF
pre-commit install

Enable GitHub Advanced Security secret scanning on every private repo. It runs on every push and scans for 200+ secret types (AWS, GCP, Azure, Stripe, Twilio, etc.) with near-zero false positives. For organisations not on GitHub Enterprise, trufflehog in CI achieves the same coverage. Either way, this scan should run on every PR — not as an afterthought but as a blocking check. A PR that introduces a secret should fail CI before it can be merged.

Why the Standard DevOps Stack Is Not Enough

A common mistake is believing that environment variables are "secure enough." They are not — they are just less convenient to read than a plaintext file. Every process running as the same OS user can read every environment variable. In containers, docker inspect <container> dumps them. Kubernetes Secret resources stored in etcd are base64-encoded, not encrypted by default, and readable by anyone with get secrets RBAC permission across the cluster. Base64 is not encryption.

The next lesson establishes the principles of a proper secrets management system: centralisation, dynamic credentials, least-privilege access, full audit logging, and automatic rotation. Every one of these principles is a direct answer to one of the leak paths mapped in this lesson.