Automated Certificates
Automated Certificates
Manual TLS certificate management is one of the most reliable sources of production outages at scale. Engineers forget renewal deadlines, rotate the wrong cert, push to a mismatched host, or simply lose track of which service owns which cert. The solution the industry converged on is full automation: the machine requests, renews, and deploys certificates without human intervention. This lesson covers the ACME protocol, Let's Encrypt, and the gold-standard Kubernetes implementation via cert-manager.
The ACME Protocol
ACME (Automatic Certificate Management Environment, RFC 8555) is the protocol Let's Encrypt built and that is now supported by nearly every public CA. It defines a machine-to-machine workflow where a client proves domain control to a CA and receives a signed certificate in return — no human dashboard, no credit card, no waiting.
The core challenge types ACME uses to prove you own a domain are:
- HTTP-01 — The CA asks your client to place a token at
http://<domain>/.well-known/acme-challenge/<token>. Simple, works for most public domains. Does not work for wildcard certs or private networks. - DNS-01 — The client creates a
_acme-challenge.<domain>TXT record. Supports wildcard certs (*.example.com) and internal domains. Requires API access to your DNS provider. - TLS-ALPN-01 — The CA makes a TLS handshake to port 443 using a special ALPN extension. Useful when only port 443 is open. Rarely needed in practice.
cert-manager: The Kubernetes Certificate Operator
cert-manager is a Kubernetes operator that automates the full certificate lifecycle: requesting, storing as Secrets, monitoring expiry, and renewing 30 days before expiration. It integrates with Let's Encrypt, Vault PKI, AWS ACM, Venafi, and self-signed CAs through a uniform API of Issuer and ClusterIssuer objects.
Install it via the official Helm chart (the only supported production method as of cert-manager v1.14+):
Issuers: Connecting cert-manager to a CA
A ClusterIssuer is cluster-scoped and can issue certificates for any namespace — the preferred choice for platform teams. A namespaced Issuer is useful when teams need independent CA configurations.
The most common production setup uses Let's Encrypt with DNS-01 via Route 53. HTTP-01 is simpler but cannot issue wildcard certs and breaks when your ingress is behind a firewall. Here is the full ClusterIssuer manifest for both staging (always test here first) and production:
letsencrypt-prod once the staging Certificate reaches Ready=True.
Requesting a Certificate
Once an Issuer is ready, you declare a Certificate resource. cert-manager creates a CertificateRequest, runs the ACME challenge, and stores the resulting TLS keypair in the Kubernetes Secret named in secretName. The Ingress or workload then mounts that Secret.
Ingress Annotation Shortcut
For simpler per-service certificates, cert-manager watches Ingress objects annotated with cert-manager.io/cluster-issuer. It automatically creates a Certificate for the listed hosts and stores it in the secret named in tls.secretName. This is useful for self-service tenant namespaces where each team manages their own ingress.
Renewal Automation and Observability
cert-manager automatically polls each Certificate resource and initiates renewal when the remaining validity falls below renewBefore. No cron jobs, no Ansible plays, no human reminders. However, you must monitor the automation itself — a failed ACME order silently retries with exponential back-off, and if you do not alert on it, you discover the failure when the cert expires and production traffic drops.
Key metrics to scrape (cert-manager exposes a Prometheus endpoint by default):
certmanager_certificate_expiration_timestamp_seconds— alert when expiry is under 14 days (indicates renewal failed despite retries)certmanager_certificate_ready_status— alert whenready=Falsefor more than 10 minutescertmanager_http_acme_client_request_count— monitor error rates to detect ACME API outages
Production Failure Modes
Understanding why automation fails is as important as setting it up. The most common failure modes in production environments are:
- DNS propagation lag — cert-manager creates the TXT record and immediately asks Let's Encrypt to verify. If your DNS TTL is high or your resolver caches negative responses, the challenge fails. Set
waitForRecordToPropagateto a higher value, or usecmctlto inspect challenge status. - IAM permission drift — IRSA roles with Route 53 permissions get rotated or restricted, breaking DNS-01 solvers silently. Audit and alert on role policy changes.
- Webhook unavailability — cert-manager's admission webhook must be reachable for cert issuance. If cert-manager itself is down during a node failure, no new certificates can be issued. Run cert-manager with a
PodDisruptionBudgetand across multiple AZs. - Secret not updated in workload — cert-manager rotates the Secret, but Pods that mounted it as a volume do not automatically reload. Use
reloader(stakater/Reloader) or build reload logic into your application.
stakater/Reloader watching TLS Secrets and rolling the Deployment, your application will serve the expired certificate even after cert-manager successfully renewed it. This is the most common "cert-manager works but cert expired" incident.
Beyond Public CAs: Internal PKI with cert-manager
Not every certificate needs Let's Encrypt. Service-to-service mTLS inside a Kubernetes cluster, internal tooling, and air-gapped environments use an internal CA — typically Vault PKI (covered in lesson 7) or cert-manager's own CA issuer type. The workflow is identical: engineers declare Certificate CRDs, cert-manager issues from the configured backend, and Secrets are rotated automatically. The difference is you control the root of trust, the validity period (often much shorter — 24 hours for internal service certs), and there are no external rate limits.
Automated certificate management is not a feature you toggle on and forget — it is a living system that requires monitoring, tested runbooks for failure scenarios, and regular validation that the automation is actually firing. When it works correctly, your team stops thinking about TLS expiry entirely. That is the goal.