Secrets Management & PKI

Project: A Secrets Architecture

18 min Lesson 10 of 28

Project: A Secrets Architecture

You have studied the individual components: Vault architecture, dynamic secrets, cloud-native stores, Kubernetes External Secrets, PKI automation, and rotation runbooks. This final lesson assembles all of them into a coherent end-to-end design — the kind a Staff Engineer would present in a design review at a company that takes security seriously. The goal is not a toy example. It is a production-grade reference architecture that you can adapt to any organisation running microservices on Kubernetes with a CI/CD pipeline.

What "end-to-end" means here: A secret is born in one authoritative place, travels to exactly the workloads that need it through verified channels, is rotated on schedule without human intervention, and its every access is logged so you can answer "who read this, when, and from where?" — for any secret, at any point in time.

The Three Planes of a Secrets Architecture

Before drawing boxes and arrows, name the planes clearly. Every secret travels across three distinct planes, and confusing them is the source of most design mistakes:

The Authority Plane — where secrets are stored and their lifecycle is governed. This is HashiCorp Vault (or AWS Secrets Manager / GCP Secret Manager for cloud-native shops). It is the single source of truth. Nothing writes secrets anywhere else.
The Distribution Plane — how secrets travel from the authority to consumers: Vault Agent sidecars, External Secrets Operator (ESO), CI OIDC token exchange, or Secrets Store CSI Driver. The distribution plane is infrastructure; application code never calls Vault directly in a well-designed system.
The Consumption Plane — where secrets are used: an application reads a file or environment variable injected by the distribution layer. The app never knows or cares where the secret came from or how it was rotated.

This separation is the key architectural insight. When the planes are clean, you can swap the distribution mechanism without changing any application code, and you can rotate secrets in the authority plane without any application restarts — the distribution layer handles renewal.

Reference Architecture Diagram

End-to-end secrets architecture: Authority, Distribution, and Consumption planes with Policy and Detection cross-cutting layers.

Wiring the CI Pipeline: No Long-Lived Credentials

The CI pipeline is the most common source of secrets sprawl in organisations that have not yet invested in secrets management. The target state is: zero long-lived credentials in CI. Every secret the pipeline needs is fetched at runtime for the duration of that specific job, using a short-lived token that proves the job's identity via OIDC. Here is the complete GitHub Actions pattern for fetching secrets from Vault and from AWS Secrets Manager:

# .github/workflows/deploy.yml
# Goal: zero static secrets in GitHub repository settings.
# CI authenticates to Vault via OIDC JWT, Vault returns short-lived AWS creds.

name: Deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required: allows GitHub to mint OIDC token for this job
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    env:
      VAULT_ADDR: https://vault.internal.example.com

    steps:
      - uses: actions/checkout@v4

      # Step 1 — Exchange GitHub OIDC token for a Vault token
      - name: Authenticate to Vault via OIDC
        id: vault-auth
        uses: hashicorp/vault-action@v3
        with:
          url: ${{ env.VAULT_ADDR }}
          method: jwt
          role: github-deploy          # Vault role bound to this repo + branch
          secrets: |
            secret/data/ci/docker  registry_user | DOCKER_USER ;
            secret/data/ci/docker  registry_pass | DOCKER_PASS ;
            aws/creds/deploy-role   access_key    | AWS_ACCESS_KEY_ID ;
            aws/creds/deploy-role   secret_key    | AWS_SECRET_ACCESS_KEY

      # AWS creds are now in env, scoped to THIS job, TTL = 15 min (Vault lease)
      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster prod \
            --service api \
            --force-new-deployment \
            --region us-east-1

# Corresponding Vault role configuration (Terraform):
# resource "vault_jwt_auth_backend_role" "github_deploy" {
#   backend         = vault_jwt_auth_backend.github.path
#   role_name       = "github-deploy"
#   role_type       = "jwt"
#   bound_claims = {
#     repository = "myorg/myrepo"
#     ref        = "refs/heads/main"
#   }
#   user_claim      = "actor"
#   token_policies  = ["ci-deploy-policy"]
#   token_ttl       = 900   # 15 min — expires when job ends
# }

Pin vault-action to a full commit SHA, not a tag. Tags are mutable — a compromised action maintainer can change what @v3 points to after you have reviewed it. Use hashicorp/vault-action@<40-char-SHA>. Apply the same rule to every third-party action you use. Run pinact run or Dependabot's "pin actions" feature to enforce this across all workflows automatically.

Wiring Kubernetes: External Secrets Operator + CSI Driver

Two patterns are standard in production Kubernetes environments, and they are complementary — not competing. Use External Secrets Operator (ESO) for secrets that need to live as native Kubernetes Secrets (because a Helm chart or operator reads them from the cluster API). Use the Secrets Store CSI Driver for secrets that must never touch etcd — high-value credentials where you need the guarantee that the secret exists only in a tmpfs mount inside the pod, disappearing when the pod terminates.

# --- Pattern A: External Secrets Operator ---
# ExternalSecret syncs Vault KV into a native K8s Secret (for legacy apps)

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 5m           # ESO re-reads Vault every 5 min
  secretStoreRef:
    name: vault-backend          # ClusterSecretStore pointing at Vault
    kind: ClusterSecretStore
  target:
    name: db-credentials         # K8s Secret created/updated by ESO
    creationPolicy: Owner
    template:
      engineVersion: v2
      data:
        DATABASE_URL: "postgresql://{{ .username }}:{{ .password }}@db.prod:5432/app"
  data:
    - secretKey: username
      remoteRef:
        key: secret/data/production/db   # Vault KV v2 path
        property: username
    - secretKey: password
      remoteRef:
        key: secret/data/production/db
        property: password

---
# --- Pattern B: Secrets Store CSI Driver ---
# Mounts Vault secret directly as a tmpfs volume — never touches etcd

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: vault-tls-cert
  namespace: production
spec:
  provider: vault
  parameters:
    vaultAddress: "https://vault.internal.example.com"
    roleName: "api-service"         # Vault K8s auth role
    objects: |
      - objectName: "tls.crt"
        secretPath: "pki/issue/api-service"
        secretKey: "certificate"
        method: "PUT"
        secretArgs:
          common_name: "api.prod.internal"
          ttl: "24h"
      - objectName: "tls.key"
        secretPath: "pki/issue/api-service"
        secretKey: "private_key"

# Pod spec — secret appears at /mnt/secrets/ as a tmpfs (ephemeral, in-memory)
# spec:
#   volumes:
#     - name: secrets-store
#       csi:
#         driver: secrets-store.csi.k8s.io
#         readOnly: true
#         volumeAttributes:
#           secretProviderClass: vault-tls-cert
#   containers:
#     - volumeMounts:
#         - name: secrets-store
#           mountPath: /mnt/secrets
#           readOnly: true

Managing Vault Configuration as Code

The configuration of Vault itself — engines, policies, auth methods, roles — must be managed as code in Git, applied via CI, and reviewed in pull requests. Engineers who click through the Vault UI to configure policies introduce the same drift problem that un-Terraformed infrastructure creates. Every Vault policy is a security boundary; it must be peer-reviewed before it changes in production.

# terraform/vault/main.tf — manage Vault config as code

terraform {
  required_providers {
    vault = { source = "hashicorp/vault", version = "~> 4.0" }
  }
}

provider "vault" {
  address = var.vault_addr
  # Auth via AppRole from CI, not a root token
}

# Enable KV v2 secrets engine
resource "vault_mount" "kv" {
  path = "secret"
  type = "kv-options"
  options = { version = "2" }
}

# Enable Kubernetes auth method
resource "vault_auth_backend" "kubernetes" {
  type = "kubernetes"
}

resource "vault_kubernetes_auth_backend_config" "main" {
  backend            = vault_auth_backend.kubernetes.path
  kubernetes_host    = "https://k8s-api.internal.example.com"
  kubernetes_ca_cert = data.kubernetes_secret.vault_sa_token.data["ca.crt"]
}

# Role: api-service pods in 'production' namespace can read DB creds
resource "vault_kubernetes_auth_backend_role" "api_service" {
  backend                          = vault_auth_backend.kubernetes.path
  role_name                        = "api-service"
  bound_service_account_names      = ["api-service-sa"]
  bound_service_account_namespaces = ["production"]
  token_policies                   = ["api-service-policy"]
  token_ttl                        = 3600   # 1 hour
}

# Policy: least privilege — api-service can ONLY read its own paths
resource "vault_policy" "api_service" {
  name = "api-service-policy"
  policy = <<EOT
path "secret/data/production/db" {
  capabilities = ["read"]
}
path "secret/data/production/cache" {
  capabilities = ["read"]
}
path "pki/issue/api-service" {
  capabilities = ["create", "update"]
}
EOT
}

Never use a root token outside the initial bootstrap. The Vault root token has unrestricted access to every secret and every policy. After the initial cluster configuration, revoke it immediately. All ongoing operations — including CI-driven Terraform applies — must use a scoped AppRole or OIDC token. If you ever need emergency root access again, use vault operator generate-root with a quorum of unseal key holders. Treat this process as you would a nuclear launch: requires two people, logged, and immediately revoked after use.

The Detection and Response Layer

A secrets architecture without detection is incomplete. The Vault audit log is your most important security feed: every read, write, auth failure, and policy denial lands there. Forward it to your SIEM on day one and build alerting on these signals:

Auth failures > 5 in 60 seconds on a single IP or role — brute force or misconfiguration.
Secret read outside business hours for production paths — human credential used by automated process, or credential exfiltration.
Root token used — always an alert, no exceptions. Either a legitimate emergency (should be communicated in advance) or an incident.
Rotation failure — the Vault TTL expired but the lease was not renewed. App is about to fail with an auth error.
Certificate expiry within 7 days — cert-manager or Vault PKI renew automation failed silently; this is your last human-visible warning.

The 2021 Twitch leak post-mortem: Twitch lost 125 GB of source code and internal credentials. The vault audit logs existed; alerts were not configured on bulk reads from an anomalous IP. The exfiltration was detectable in real time — it was not detected until the data appeared on 4chan. A properly configured SIEM alert on "unusual volume of Vault reads from non-service IP" would have fired within minutes of the exfiltration starting.

Architecture Decision Checklist

Before signing off on any secrets architecture design, verify every item on this list. Each is a production failure mode you have now seen the full path of:

Single source of truth: Exactly one authoritative store per environment (Vault cluster or cloud-native SM). No secrets maintained in two places.
No static secrets in CI: Every CI secret is fetched at job runtime via OIDC. Repository-level secrets contain zero credentials.
Dynamic credentials for databases: No application has a permanent database password. Vault DB engine generates credentials per-service with a TTL and rotates them automatically.
K8s Secrets encryption at rest: etcd encrypted with KMS provider (AWS KMS, GCP CKMS). ESO-synced secrets have appropriate RBAC so only the target service account can read them.
All TLS certs from PKI automation: cert-manager or Vault Agent issues and renews all internal certs. No manually issued or manually renewed certificate in the cluster.
Vault config in Git: Every Vault policy, role, and engine mount is Terraformed and peer-reviewed. No click-ops in the Vault UI for production.
Auto-unseal configured: Vault uses AWS KMS or GCP Cloud KMS for auto-unseal so a node restart does not require manual operator intervention at 3 AM.
Audit log forwarded to SIEM: All Vault audit events forwarded with alerts defined for the five signals above.
Rotation runbook tested: The last-resort manual rotation runbook (Lesson 9) has been executed in a staging environment within the last 90 days.
Blast radius documented: For every secret, the team can answer: who can read it, what breaks if it is rotated right now, and how long until full rotation propagates?

This architecture is not a theoretical ideal. It is the operational standard at companies that take secrets seriously — Google, Netflix, HashiCorp, and Stripe all operate something equivalent. The individual components are all documented and open. The differentiator is the discipline to keep all three planes clean, to manage configuration as code, and to treat every deviation from least-privilege as a P1 issue rather than a TODO comment.