PKI Fundamentals
PKI Fundamentals
Every TLS connection you've ever made — to GitHub, to your cloud console, between microservices — relies on a trust infrastructure called Public Key Infrastructure (PKI). PKI is not just a certificate file sitting on a server; it is a chain of cryptographic proof linking your certificate all the way back to a root authority that the client already trusts. Senior DevOps engineers must understand this chain deeply, because misconfigurations at any link cause cascading production failures: browsers refuse connections, mutual-TLS (mTLS) authentication breaks, and automated rotation fails silently.
Certificate Authorities and the Chain of Trust
A Certificate Authority (CA) is an entity whose job is to sign certificates — cryptographically binding a public key to an identity. There are three tiers in a production PKI:
- Root CA — the ultimate trust anchor. Its certificate is self-signed. OS and browser vendors ship a curated list of trusted root CA certificates (the trust store). The root CA private key is kept completely offline (often in an HSM in a physically secured facility). It signs only Intermediate CA certificates, nothing else.
- Intermediate (Subordinate) CA — an online CA whose certificate was signed by the root. All day-to-day certificate issuance happens here. If an intermediate is compromised, it can be revoked without touching the root.
- Leaf (End-Entity) Certificate — the certificate installed on a server, device, or user. Signed by the intermediate. This is what
openssl s_clientsees when it connects to your service.
When a client (browser, curl, gRPC client) validates a certificate, it walks this chain: leaf → intermediate → root. If it can build an unbroken chain to a root it already trusts, the handshake succeeds. This is called chain building or path validation. The client checks three things at each step: the signature is valid, the certificate is not expired, and the certificate has not been revoked.
Subject Alternative Names (SANs)
The old Common Name (CN) field was the only way to specify what hostname a certificate covered. Modern browsers and RFC 6125 require Subject Alternative Names (SANs) instead. A SAN extension can hold:
DNS:entries — exact hostnames (api.example.com) or wildcards (*.example.com)IP:entries — used for inter-pod mTLS where pod IPs are the identityURI:entries — used by SPIFFE (Secure Production Identity Framework for Everyone), e.g.spiffe://cluster.local/ns/default/sa/payment-svcEmail:entries — for S/MIME client certificates
To inspect SANs on any certificate from the command line:
Certificate Lifecycle
Every certificate has a hard expiry date. The lifecycle phases are:
- Generation — create a private key and a Certificate Signing Request (CSR). The CSR contains your public key and identity claims (CN, SANs). The private key never leaves your system.
- Issuance — the CA verifies the CSR (via DNS-01, HTTP-01 challenge for public CAs, or internal policy for private CAs), signs it, and returns the certificate. Validity period is set here.
- Deployment — the certificate and private key are loaded into the server (Nginx, Kubernetes Secret, Vault's PKI engine). The chain file (intermediate + root) must be served alongside the leaf.
- Renewal — start renewal at ~two-thirds of the validity period. For 90-day Let's Encrypt certs that means day 60. For 24-hour Vault-issued mTLS certs, your automation must handle hourly rotation.
- Revocation — if a private key is compromised, the certificate is revoked via CRL (Certificate Revocation List) or OCSP (Online Certificate Status Protocol). Revocation is notoriously unreliable in browsers; short-lived certificates are a better answer.
Generating a CSR and Self-Signed Cert
For internal services or testing, you often need to generate your own CA and issue certificates against it. Here is the full workflow using openssl:
Common Production Failure Modes
Understanding failure modes is what separates a senior engineer from a junior one. These are the real incidents you will encounter:
- Incomplete chain served — the server sends only the leaf, not the intermediate. Desktop browsers cache intermediates and seem fine; mobile browsers,
curl, and service-to-service clients fail with unable to get local issuer certificate. Always concatenate leaf + intermediate into the bundle you serve. Useopenssl s_client -connect host:443 -showcertsto verify the full chain is sent. - SAN mismatch — a certificate for
api.example.comdeployed behind a reverse proxy that forwards asapi-internal.example.com. TLS validation fails. Always include all names — internal aliases, load balancer names, and pod DNS names — in the SAN list at issuance time. - Clock skew — a certificate issued at 14:00:00 UTC deployed to a server whose clock reads 13:59:50 UTC will be rejected as "not yet valid." NTP synchronization is a PKI prerequisite.
- Expiry surprise — certificates expire at 03:00 AM and no one notices until traffic drops and on-call fires at 03:05 AM. The fix: monitor
ssl_certificate_expiry_seconds(Prometheusblackbox_exporter), alert at 30 days and again at 7 days. - Root not in trust store — an internal private CA root was never distributed to all services and container base images. New services fail mTLS. Manage trust store distribution via configuration management (Ansible, Chef) or bake it into your base Docker image.
Monitoring Certificate Expiry in Production
PKI mastery unlocks everything in the next lessons: Vault's PKI secrets engine can replace your static openssl workflow entirely, issuing certificates programmatically with TTLs as short as one hour, and automatically revoking them when a service is decommissioned. The key concepts from this lesson — the three-tier hierarchy, SAN semantics, lifecycle phases, and chain building — are the foundation you need to configure that system correctly.