Networking Essentials for DevOps

TLS & Certificates

18 min Lesson 5 of 30

TLS & Certificates

Transport Layer Security (TLS) is the protocol that puts the padlock on every HTTPS URL. It sits between TCP (which you know from lesson 1) and the application layer, providing three guarantees simultaneously: confidentiality (no one can read the traffic), integrity (no one can tamper with it undetected), and authentication (you are actually talking to the server you think you are). Understanding TLS deeply is non-negotiable for any DevOps engineer — you will debug certificate errors, rotate expiring certs, and tune handshake performance regularly.

The TLS Handshake, Step by Step

The handshake is the negotiation that happens before a single byte of application data flows. Modern deployments almost exclusively use TLS 1.3, which is both faster and more secure than its predecessors. Here is exactly what happens:

TLS 1.3 Handshake Sequence Client Server ClientHello (supported ciphers, random, key share) ServerHello + Certificate + CertificateVerify + Finished Finished (client confirms handshake) Encrypted Application Data (HTTP/2, gRPC, etc.) 1-RTT 0-RTT possible on resumption
TLS 1.3 handshake: one round-trip before encrypted data flows. TLS 1.2 needed two.

In TLS 1.3 the client sends its key share speculatively in the very first message. The server can therefore compute the shared secret immediately and send its certificate and a Finished message all in one flight. The client verifies the certificate, sends its own Finished, and encrypted data flows. That is one round-trip — half the latency of TLS 1.2.

Key exchange in TLS 1.3 always uses ephemeral Diffie-Hellman (ECDHE). There is no option to use RSA key exchange, which means forward secrecy is mandatory: compromising the server's private key today cannot decrypt yesterday's recorded traffic.

Certificate Chains and the Chain of Trust

A certificate on its own proves nothing unless a browser or OS already trusts the entity that signed it. The trust model works as a chain:

  • Root CA — self-signed, embedded in operating systems and browsers. Examples: ISRG Root X1 (Let's Encrypt's root), DigiCert Global Root CA.
  • Intermediate CA — signed by the root, used for day-to-day issuance. Root CAs are kept offline; intermediates do the actual signing.
  • End-entity (leaf) certificate — your server's certificate, signed by the intermediate.

When your server sends its certificate, it must also send the intermediate(s). Browsers will not fetch missing intermediates — they just fail. A classic production incident is deploying a cert without bundling the intermediate chain. Verify with:

# Verify the chain served by a live host openssl s_client -connect api.example.com:443 -showcerts 2>/dev/null | openssl x509 -noout -text | grep -A2 "Issuer\|Subject\|Not After" # Check what the server actually sends (all certs in chain) openssl s_client -connect api.example.com:443 -showcerts </dev/null 2>/dev/null | \ awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' | \ openssl x509 -noout -subject -issuer # Verify a local cert file against a CA bundle openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt /etc/nginx/ssl/server.crt

SAN Certificates (Subject Alternative Names)

The CN (Common Name) field of a certificate is legacy and browsers no longer use it for hostname validation — they only check the Subject Alternative Name (SAN) extension. A single cert can protect many hostnames via multiple SANs:

  • DNS:example.com
  • DNS:www.example.com
  • DNS:api.example.com
  • DNS:*.staging.example.com (wildcard — one level only)

Wildcard SANs (*.example.com) cover exactly one subdomain level and do not match the apex (example.com) or deeper subdomains (a.b.example.com). For microservices with many unique hostnames, a single multi-SAN cert is far simpler to manage than dozens of individual certs.

Let's Encrypt and Automated Certificate Management (ACME)

Let's Encrypt issues free, 90-day DV (domain-validated) certificates via the ACME protocol. The short lifetime is intentional — it forces automation and limits damage from key compromise. In production, you never manually download a cert; you run an ACME client that handles issuance and renewal automatically.

The two common validation methods:

  • HTTP-01 — ACME places a token at http://yourdomain/.well-known/acme-challenge/<token>. Works for any port-80-accessible host. Cannot be used for wildcards.
  • DNS-01 — ACME creates a _acme-challenge.yourdomain TXT record. Works for wildcards, internal hosts, and hosts without port-80 exposure. Requires DNS API access.
# certbot: issue a cert for an nginx host (HTTP-01, auto-configures nginx) certbot --nginx -d example.com -d www.example.com # certbot: wildcard via DNS-01 (requires DNS plugin, e.g. Route 53) certbot certonly \ --dns-route53 \ -d example.com \ -d "*.example.com" # Renew all certs (run from a daily cron or systemd timer) certbot renew --quiet --deploy-hook "systemctl reload nginx" # List current certs and their expiry certbot certificates
Renew at 60 days, not 89. Let's Encrypt certs live 90 days. Certbot's default renewal window (30 days before expiry) is sensible, but large platforms often trigger renewal at 60 days to give a wide safety margin for automation failures, DNS propagation delays, and rate limits. Always set up monitoring on expiry separately from the renewal automation — automation can silently fail.

Terminating TLS in Production

TLS is typically terminated at the edge — a load balancer, API gateway, or reverse proxy — so backend services communicate over plain HTTP on a private network. This is called TLS offloading. For services that require end-to-end encryption (payment processors, healthcare), mTLS (mutual TLS) is used: both the client and server present certificates, giving strong identity on both ends. Service meshes (Istio, Linkerd) automate mTLS between every pod in a cluster.

# nginx: production TLS configuration (TLS 1.2 min, 1.3 preferred) server { listen 443 ssl; server_name api.example.com; ssl_certificate /etc/nginx/ssl/api.example.com.crt; # leaf + intermediates ssl_certificate_key /etc/nginx/ssl/api.example.com.key; ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256; ssl_prefer_server_ciphers off; # TLS 1.3 ignores this; keep off for 1.2 fairness ssl_session_cache shared:SSL:10m; ssl_session_timeout 1d; ssl_session_tickets off; # off = perfect forward secrecy for sessions ssl_stapling on; # OCSP stapling reduces client round-trips ssl_stapling_verify on; resolver 8.8.8.8 valid=60s; add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always; }

Diagnosing Certificate Failures

The most common certificate errors in production and how to triage them:

  • CERTIFICATE_VERIFY_FAILED / ERR_CERT_AUTHORITY_INVALID — The server did not send the full chain. Check with openssl s_client -showcerts.
  • ERR_CERT_DATE_INVALID / certificate has expired — Renewal automation broke. Check certbot certificates, cron logs, and whether the reload hook ran.
  • ERR_CERT_COMMON_NAME_INVALID / hostname mismatch — The hostname you're connecting to is not in the cert's SANs. Inspect with openssl x509 -noout -ext subjectAltName.
  • SSL_ERROR_RX_RECORD_TOO_LONG — The server is responding with plain HTTP on a port the client expected TLS. A classic misconfiguration: traffic hitting port 80 instead of 443.
# One-liner triage: check expiry, SANs, and chain of a live host echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null \ | openssl x509 -noout -dates -ext subjectAltName # Check OCSP status of the served certificate openssl s_client -connect api.example.com:443 -status 2>/dev/null | grep -A 10 "OCSP Response" # curl verbose: see TLS version, cipher, and cert chain in one command curl -vvI https://api.example.com 2>&1 | grep -E "SSL|TLS|issuer|expire|subject"
Do not disable certificate verification in production tooling. Flags like curl -k, Python's verify=False, or PYTHONHTTPSVERIFY=0 are acceptable only in isolated local debugging. In staging or CI pipelines they mask real certificate problems that will break production. If your internal PKI is not trusted, install the CA cert into the system trust store — do not disable verification.

TLS is the foundation of every secure service you will run. Get comfortable with openssl s_client as your first-response tool; it tells you more about a TLS connection in five seconds than most GUI tools will in five minutes.