Secrets Management & PKI

Automated Certificates

18 min Lesson 8 of 28

Automated Certificates

Manual TLS certificate management is one of the most reliable sources of production outages at scale. Engineers forget renewal deadlines, rotate the wrong cert, push to a mismatched host, or simply lose track of which service owns which cert. The solution the industry converged on is full automation: the machine requests, renews, and deploys certificates without human intervention. This lesson covers the ACME protocol, Let's Encrypt, and the gold-standard Kubernetes implementation via cert-manager.

The ACME Protocol

ACME (Automatic Certificate Management Environment, RFC 8555) is the protocol Let's Encrypt built and that is now supported by nearly every public CA. It defines a machine-to-machine workflow where a client proves domain control to a CA and receives a signed certificate in return — no human dashboard, no credit card, no waiting.

The core challenge types ACME uses to prove you own a domain are:

HTTP-01 — The CA asks your client to place a token at http://<domain>/.well-known/acme-challenge/<token>. Simple, works for most public domains. Does not work for wildcard certs or private networks.
DNS-01 — The client creates a _acme-challenge.<domain> TXT record. Supports wildcard certs (*.example.com) and internal domains. Requires API access to your DNS provider.
TLS-ALPN-01 — The CA makes a TLS handshake to port 443 using a special ALPN extension. Useful when only port 443 is open. Rarely needed in practice.

Let's Encrypt rate limits matter in production. The main limits are 50 certificates per registered domain per week and 5 duplicate certificates per week. If you are issuing per-pod or per-ephemeral-service certs, you will hit these limits fast. Use wildcard certs or an internal CA for anything high-volume.

cert-manager: The Kubernetes Certificate Operator

cert-manager is a Kubernetes operator that automates the full certificate lifecycle: requesting, storing as Secrets, monitoring expiry, and renewing 30 days before expiration. It integrates with Let's Encrypt, Vault PKI, AWS ACM, Venafi, and self-signed CAs through a uniform API of Issuer and ClusterIssuer objects.

Install it via the official Helm chart (the only supported production method as of cert-manager v1.14+):

# Install cert-manager CRDs and the operator
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.14.5 \
  --set crds.enabled=true

# Verify all three pods are Running
kubectl get pods -n cert-manager
# NAME                                       READY   STATUS    RESTARTS   AGE
# cert-manager-xxxx                          1/1     Running   0          60s
# cert-manager-cainjector-xxxx               1/1     Running   0          60s
# cert-manager-webhook-xxxx                  1/1     Running   0          60s

Issuers: Connecting cert-manager to a CA

A ClusterIssuer is cluster-scoped and can issue certificates for any namespace — the preferred choice for platform teams. A namespaced Issuer is useful when teams need independent CA configurations.

The most common production setup uses Let's Encrypt with DNS-01 via Route 53. HTTP-01 is simpler but cannot issue wildcard certs and breaks when your ingress is behind a firewall. Here is the full ClusterIssuer manifest for both staging (always test here first) and production:

# cluster-issuer.yaml
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: platform@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
    - dns01:
        route53:
          region: us-east-1
          # Use IRSA (IAM Roles for Service Accounts) — never hardcode keys
          role: arn:aws:iam::123456789012:role/cert-manager-dns01-role
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: platform@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
    - dns01:
        route53:
          region: us-east-1
          role: arn:aws:iam::123456789012:role/cert-manager-dns01-role

Always validate against the staging CA before switching to production. The staging CA issues untrusted but structurally valid certificates — it lets you catch DNS misconfiguration, IAM permission errors, and solver bugs without burning your production rate-limit quota. Only flip to letsencrypt-prod once the staging Certificate reaches Ready=True.

Requesting a Certificate

Once an Issuer is ready, you declare a Certificate resource. cert-manager creates a CertificateRequest, runs the ACME challenge, and stores the resulting TLS keypair in the Kubernetes Secret named in secretName. The Ingress or workload then mounts that Secret.

cert-manager ACME issuance flow: from Certificate CRD to a TLS Secret mounted by the Ingress.

# certificate.yaml — request a wildcard cert for *.example.com
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: ingress-nginx
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - "example.com"
  - "*.example.com"
  duration: 2160h        # 90 days (Let's Encrypt maximum)
  renewBefore: 720h      # Renew 30 days before expiry

# Check issuance status
kubectl describe certificate wildcard-example-com -n ingress-nginx
# Events should show: Issuing, CertificateRequest created, Order created,
# Challenge pending, Certificate issued successfully

kubectl get secret wildcard-example-com-tls -n ingress-nginx \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

Ingress Annotation Shortcut

For simpler per-service certificates, cert-manager watches Ingress objects annotated with cert-manager.io/cluster-issuer. It automatically creates a Certificate for the listed hosts and stores it in the secret named in tls.secretName. This is useful for self-service tenant namespaces where each team manages their own ingress.

# ingress-with-tls.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  namespace: payments
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.payments.example.com
    secretName: api-payments-tls
  rules:
  - host: api.payments.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: payments-api
            port:
              number: 8080

Renewal Automation and Observability

cert-manager automatically polls each Certificate resource and initiates renewal when the remaining validity falls below renewBefore. No cron jobs, no Ansible plays, no human reminders. However, you must monitor the automation itself — a failed ACME order silently retries with exponential back-off, and if you do not alert on it, you discover the failure when the cert expires and production traffic drops.

Key metrics to scrape (cert-manager exposes a Prometheus endpoint by default):

certmanager_certificate_expiration_timestamp_seconds — alert when expiry is under 14 days (indicates renewal failed despite retries)
certmanager_certificate_ready_status — alert when ready=False for more than 10 minutes
certmanager_http_acme_client_request_count — monitor error rates to detect ACME API outages

Run a certificate canary in staging. Big-tech SRE teams maintain a dedicated cert in a low-traffic namespace whose sole purpose is to exercise the full issuance pipeline continuously. If staging cert renewal breaks (new network policy blocks port 53, IRSA role drifted, DNS provider API changed), you catch it before it affects production certificates.

Production Failure Modes

Understanding why automation fails is as important as setting it up. The most common failure modes in production environments are:

DNS propagation lag — cert-manager creates the TXT record and immediately asks Let's Encrypt to verify. If your DNS TTL is high or your resolver caches negative responses, the challenge fails. Set waitForRecordToPropagate to a higher value, or use cmctl to inspect challenge status.
IAM permission drift — IRSA roles with Route 53 permissions get rotated or restricted, breaking DNS-01 solvers silently. Audit and alert on role policy changes.
Webhook unavailability — cert-manager's admission webhook must be reachable for cert issuance. If cert-manager itself is down during a node failure, no new certificates can be issued. Run cert-manager with a PodDisruptionBudget and across multiple AZs.
Secret not updated in workload — cert-manager rotates the Secret, but Pods that mounted it as a volume do not automatically reload. Use reloader (stakater/Reloader) or build reload logic into your application.

Reloader is essential in production. cert-manager updates the Kubernetes Secret when it renews a cert, but running Pods hold the old cert in memory. Without a controller like stakater/Reloader watching TLS Secrets and rolling the Deployment, your application will serve the expired certificate even after cert-manager successfully renewed it. This is the most common "cert-manager works but cert expired" incident.

Beyond Public CAs: Internal PKI with cert-manager

Not every certificate needs Let's Encrypt. Service-to-service mTLS inside a Kubernetes cluster, internal tooling, and air-gapped environments use an internal CA — typically Vault PKI (covered in lesson 7) or cert-manager's own CA issuer type. The workflow is identical: engineers declare Certificate CRDs, cert-manager issues from the configured backend, and Secrets are rotated automatically. The difference is you control the root of trust, the validity period (often much shorter — 24 hours for internal service certs), and there are no external rate limits.

Automated certificate management is not a feature you toggle on and forget — it is a living system that requires monitoring, tested runbooks for failure scenarios, and regular validation that the automation is actually firing. When it works correctly, your team stops thinking about TLS expiry entirely. That is the goal.