Web Servers & Reverse Proxies

Tuning & Troubleshooting

18 min Lesson 9 of 28

Tuning & Troubleshooting

A correctly installed Nginx serves requests. A correctly tuned Nginx serves ten times as many without breaking a sweat — and when something does go wrong, its logs tell you exactly why within seconds. This lesson covers three interlocking topics that every production engineer must own: worker process and connection limits, file descriptor headroom, and systematic log analysis.

Worker Processes and Worker Connections

Nginx uses a master + worker architecture. The master process owns the listening sockets and reloads config; worker processes do the actual I/O — reading requests, calling upstreams, sending responses. Workers are single-threaded and event-driven, so they can juggle thousands of simultaneous connections each, without threads or locks.

The two directives that define your server's top-end capacity live in nginx.conf at the top level and in the events block:

# /etc/nginx/nginx.conf # One worker per logical CPU core is the canonical baseline. # 'auto' lets Nginx detect the core count at startup — use it everywhere. worker_processes auto; events { # Maximum simultaneous connections per worker. # Total theoretical capacity = worker_processes * worker_connections. # On a 4-core box with worker_connections 1024: up to 4,096 connections. worker_connections 1024; # Use epoll on Linux (the kernel's scalable I/O event interface). # Nginx selects this automatically, but being explicit documents intent. use epoll; # Allow each worker to accept all pending connections in one syscall # instead of waking up one worker per connection. Reduces context switches # under bursty traffic. Enable once you see 'accept mutex' in strace output. multi_accept on; }
How large should worker_connections be? Each open connection consumes roughly one file descriptor and a small amount of memory (8–16 KB depending on buffer sizes). A safe ceiling is: worker_connections = (available_file_descriptors / worker_processes) * 0.9. On a server with 65 535 FDs and 4 workers that gives you ~14 700 per worker — far above the 1024 default. Raise it to 4096 or 8192 for traffic-heavy front-ends.

File Descriptors: The Hidden Bottleneck

Every open socket, file, or pipe counts against the operating system's file descriptor (FD) limit. Under load, an under-configured server will log too many open files and start rejecting connections — a hard failure that looks like random timeouts to users.

There are two limits you must raise in concert:

  • OS-level (system-wide): /proc/sys/fs/file-max — the absolute ceiling for all processes on the machine.
  • Process-level (per-process): the ulimit -n of the Nginx worker process, controlled in nginx.conf via worker_rlimit_nofile.
# 1. Raise the system-wide limit persistently in /etc/sysctl.conf fs.file-max = 200000 # Apply immediately (no reboot needed) sysctl -p # 2. Raise the per-process limit inside nginx.conf (main context) worker_rlimit_nofile 65535; # After changing nginx.conf, reload — no downtime: nginx -t && systemctl reload nginx # 3. Verify what Nginx workers are actually using cat /proc/$(pgrep -f 'nginx: worker')/limits | grep 'open files'
systemd caps can silently override your settings. If you manage Nginx with systemd (standard on Ubuntu 20+ and CentOS 8+), it applies its own LimitNOFILE. Check with systemctl show nginx | grep LimitNOFILE. To raise it, create a drop-in: systemctl edit nginx, add [Service] / LimitNOFILE=65535, then systemctl daemon-reload && systemctl reload nginx. Both the systemd limit and worker_rlimit_nofile must be set — they are independent knobs.

Reading Nginx Access Logs

The access log is your real-time window into production traffic. Every request Nginx handled appears here. The default combined log format captures the IP, timestamp, HTTP method, URI, status code, bytes sent, referer, and user-agent — enough to reconstruct exactly what happened.

# Live-tail the access log — essential during an incident tail -f /var/log/nginx/access.log # Count HTTP 5xx errors in the last minute (useful in a shell alert loop) awk -v d="$(date +'%d/%b/%Y:%H:%M')" '$4 ~ d && $9 ~ /^5/' /var/log/nginx/access.log | wc -l # Top 10 slowest URIs (requires $request_time in the log format) sort -t'"' -k6 -rn /var/log/nginx/access.log | head -10 # Find which IP is hammering the server right now awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
Add $request_time and $upstream_response_time to your log format. The default format does not include these. Without them you cannot distinguish "Nginx was slow" (high $request_time, low $upstream_response_time) from "the backend was slow" (both high). Add a custom log format in nginx.conf and use it on all production server blocks:
log_format timed_combined '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    '"$http_referer" "$http_user_agent" '
    'rt=$request_time urt=$upstream_response_time';

Reading Nginx Error Logs

Error logs capture everything Nginx could not handle gracefully: upstream timeouts, worker crashes, permission errors, config mistakes picked up at runtime, and resource exhaustion. The default level is error; you can temporarily drop to warn or info for debugging, but never leave debug on in production — it writes a line per byte and saturates your disk.

# Watch errors in real time tail -f /var/log/nginx/error.log # Common patterns to grep for during an incident: grep 'connect() failed' /var/log/nginx/error.log # upstream is down grep 'too many open files' /var/log/nginx/error.log # FD limit hit grep 'worker_connections' /var/log/nginx/error.log # connection cap hit grep 'upstream timed out' /var/log/nginx/error.log # backend latency spike grep 'no live upstreams' /var/log/nginx/error.log # all backend servers failed # Temporarily raise log verbosity for one vhost (without a full reload): # Set 'error_log /var/log/nginx/debug.log debug;' inside the server{} block, # then reload. Revert immediately after capturing the trace.

Putting It All Together: A Tuning Checklist

When preparing a server for production or diagnosing performance regression, run through this sequence:

  1. Run nproc and confirm worker_processes auto; is set — or pin it explicitly.
  2. Check worker_connections: for a front-end reverse proxy serving 10 k+ concurrent users, 4096–8192 is a good starting point.
  3. Verify worker_rlimit_nofile matches or exceeds worker_connections * 2 (each connection to a client plus one to an upstream).
  4. Confirm the systemd LimitNOFILE is aligned.
  5. Add $request_time and $upstream_response_time to your log format before you go live — retrofitting logs after an incident is too late.
  6. Set up log rotation with logrotate (included in most distros) to prevent access logs from consuming all disk space over weeks of traffic.
Use nginx -T (capital T) to dump the full parsed config. This is the single most effective way to confirm that an include, a server block, or a directive is actually active. The output also double-checks your last nginx -t syntax validation by showing what Nginx actually loaded into memory.