Databases in Production

Connection Management & Pooling

18 min Lesson 6 of 30

Connection Management & Pooling

Every PostgreSQL or MySQL connection is a heavyweight OS process (or thread). At peak traffic, naive application designs open a new connection per request — and at scale, that turns into a production outage faster than almost any other database anti-pattern. Understanding how connections are actually managed, and why a pooler belongs between your application and your database, is a non-negotiable skill for any engineer running databases at production scale.

What max_connections Actually Means

In PostgreSQL, max_connections is a hard cap on the total number of backend processes the server will accept — including your application, replication standbys, background workers, and your own admin sessions. The default is 100. Each connection consumes roughly 5–10 MB of RAM for its backend process plus shared memory overhead. On a 16 GB instance, allowing 500 connections consumes 2.5–5 GB before a single query executes.

The critical insight: most connections are idle most of the time. A typical web application with 200 concurrent users does not need 200 active database connections — it needs at most a fraction of that, because database operations last milliseconds. But without a pooler, most frameworks open and hold one connection per thread or goroutine, which means 200 threads means 200 connections.

MySQL uses max_connections the same way, but its per-connection cost differs because MySQL uses a thread model rather than a forked-process model. The memory math is still critical — MySQL allocates per-thread buffers (sort_buffer_size, join_buffer_size, etc.) on demand, so 1,000 connections executing large sorts can exhaust RAM abruptly.

The "too many connections" cascade: When a database hits max_connections, new connection attempts are rejected with an error — not queued. Your application now throws exceptions for every request. The exceptions generate retries, which generate more connection attempts, which are also rejected. This positive-feedback loop can hold a service down for minutes even after the root cause is resolved. A connection pooler prevents this by queuing new requests at the pooler level instead of hammering the database.

The Pooler Tier: PgBouncer and ProxySQL

PgBouncer is the de-facto standard connection pooler for PostgreSQL. It is a single-threaded C process that sits between your application and PostgreSQL. PgBouncer maintains a small number of long-lived server connections and multiplexes thousands of short-lived client connections across them. It is lightweight enough to run on the same host as the database without meaningful CPU impact.

PgBouncer operates in three modes:

Session pooling: A server connection is assigned to a client for the duration of the client session. This is the safest mode and compatible with all PostgreSQL features (including SET, LISTEN, and prepared statements), but provides the least multiplexing benefit.
Transaction pooling: A server connection is assigned only for the duration of a transaction, then returned to the pool. This is the recommended production mode for most workloads — it allows hundreds of client connections to share a small server pool. Restrictions: prepared statements and advisory locks must be managed at the application layer, or you use PgBouncer 1.21+ with its server-side prepared-statement tracking.
Statement pooling: A server connection is returned after each statement. Rarely used because it breaks multi-statement transactions.

ProxySQL is the equivalent for MySQL and MariaDB, but with a significantly broader feature set: query routing rules, read/write splitting, query rewriting, multiplexed connections, and rich observability via its internal stats tables. ProxySQL is commonly used in front of Galera clusters and MySQL Group Replication to route writes to the primary and reads to replicas transparently.

PgBouncer in transaction mode: 500 application client connections multiplexed onto 25 real PostgreSQL server connections.

Configuring PgBouncer for Production

The minimal production PgBouncer configuration for a PostgreSQL primary. This file lives at /etc/pgbouncer/pgbouncer.ini:

[databases]
# Route appdb on port 5432 of pgbouncer to the real DB host
appdb = host=pg-primary.internal port=5432 dbname=appdb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432

# Use auth_query instead of a flat file in modern setups
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Transaction pooling: the recommended mode for most web workloads
pool_mode = transaction

# Max client connections PgBouncer will accept
max_client_conn = 2000

# Server connections per database/user pair
default_pool_size = 25

# Connections to keep warm (avoid cold-start latency spikes)
min_pool_size = 5

# Reserve pool for admin/emergency use
reserve_pool_size = 5
reserve_pool_timeout = 3

# Kill idle server connections after 10 min (avoids stale TCP)
server_idle_timeout = 600

# Kill client connections idle for 5 min (prevents ghost connections)
client_idle_timeout = 300

# Timeouts to prevent slow clients from holding server conns
server_connect_timeout = 10
query_timeout = 30
query_wait_timeout = 10

# Log only slow queries (reduce noise)
log_connections = 0
log_disconnections = 0
log_pooler_errors = 1

# Admin interface — restrict to localhost
admin_users = pgbouncer_admin
stats_users = stats_user

The userlist file: Generate MD5 or SCRAM hashes for PgBouncer with echo "md5$(echo -n 'passwordusername' | md5sum | awk '{print $1}')". For SCRAM, use psql -c "SELECT rolpassword FROM pg_authid WHERE rolname = 'appuser';" on the database and copy the SCRAM-SHA-256$... string directly into userlist.txt. In production, use auth_query to have PgBouncer query the database for credentials directly, eliminating the sync problem entirely.

Configuring ProxySQL for MySQL

ProxySQL is configured via its admin interface (a MySQL-protocol connection on port 6032) rather than a config file, which makes it suitable for programmatic management and Terraform. The key tables are mysql_servers, mysql_users, and mysql_query_rules.

-- Connect to ProxySQL admin (port 6032, not 3306)
-- mysql -u admin -padmin -h 127.0.0.1 -P 6032

-- Add backend servers
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight, max_connections)
VALUES
  (10, 'mysql-primary.internal', 3306, 1000, 200),   -- writes group
  (20, 'mysql-replica1.internal', 3306, 500, 400),   -- reads group
  (20, 'mysql-replica2.internal', 3306, 500, 400);

-- Add application user (must exist on the MySQL backend too)
INSERT INTO mysql_users (username, password, default_hostgroup, transaction_persistent)
VALUES ('appuser', 'hashed_password_here', 10, 1);
-- transaction_persistent=1: keep the whole transaction on one backend

-- Route writes to primary (hostgroup 10), reads to replicas (hostgroup 20)
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
  (1, 1, '^SELECT', 20, 1),    -- SELECTs go to read group
  (2, 1, '.*', 10, 1);         -- everything else goes to write group

-- Apply and persist
LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
LOAD MYSQL USERS TO RUNTIME;
SAVE MYSQL USERS TO DISK;
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

-- Monitor connection pool status
SELECT hostgroup, srv_host, status, ConnUsed, ConnFree, ConnOK, ConnERR
FROM stats_mysql_connection_pool;

Pool Sizing: The Formula and the Reality

The most-cited pool sizing rule comes from the HikariCP documentation, which itself derives from database benchmarking research: pool size = (core count × 2) + effective spindle count. For a typical 8-core application server with SSD storage, that yields a pool of around 17 connections. This number shocks engineers who assume bigger is better.

The logic: a CPU can only execute one thread at a time per core. If you have more database connections than CPU cores on the database host, you are simply serializing work on the database side with more overhead (context switching, memory pressure) than fewer connections would incur. Saturating the database CPU is the failure mode, not the goal.

In practice, size your pool by starting with the formula and validating against the following signals in production:

pg_stat_activity active count — how many connections are genuinely executing queries (state = 'active') vs idle?
PgBouncer cl_waiting — clients waiting for a server connection. If this is non-zero, your pool is undersized for peak load.
Database CPU utilization — if it is under 60% at peak, you have headroom; if it is above 80%, adding more pool connections will make latency worse, not better.
Query latency p99 — should not degrade as pool size increases. If it does, you are thrashing the database.

RDS/Aurora connection limits: AWS caps max_connections based on instance RAM using the formula LEAST({DBInstanceClassMemory/9531392}, 5000). A db.t3.micro (1 GB RAM) gets approximately 87 connections. With 3 application servers each holding a pool of 30, you have already used all of them. RDS Proxy (built on PgBouncer/ProxySQL architecture) exists precisely for this problem and is the recommended solution for Lambda or ECS workloads that open many short-lived connections.

Production Failure Modes and How to Detect Them

Three connection-related failure modes recur across production incidents at every scale:

1. Connection leaks: Application code acquires a connection and fails to release it (uncaught exception, missing finally block, ORM bug). Connections accumulate until the pool is exhausted. Detect with: SELECT count(*), state FROM pg_stat_activity GROUP BY state; — a growing idle count with no corresponding load is a leak.

2. Long-running idle-in-transaction: A connection starts a transaction, does some work, then hangs (waiting for user input, blocked on an external API call). The transaction holds locks. Other queries queue behind it. The pool looks "full" even though most connections are idle. Detect with:

-- Find connections idle in transaction for more than 30 seconds
SELECT pid, usename, application_name, state, wait_event_type, wait_event,
       now() - state_change AS idle_duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - state_change > interval '30 seconds'
ORDER BY idle_duration DESC;

-- Terminate a specific offending backend (replace PID)
SELECT pg_terminate_backend(12345);

-- PgBouncer: set this to kill idle-in-transaction connections automatically
-- In pgbouncer.ini:
--   idle_transaction_timeout = 30

3. Pool starvation under burst traffic: A sudden spike in request rate causes all pool connections to be in use simultaneously. New requests queue at the pooler. If the queue exceeds query_wait_timeout, PgBouncer returns an error to the client. This is correct behaviour — the pooler protects the database — but the application must handle this error gracefully (exponential backoff, circuit breaker) rather than crashing.

Kubernetes sidecar pattern: In Kubernetes, deploy PgBouncer as a sidecar in each application pod rather than as a shared service. This simplifies auth (each pod connects to the database using its own credentials via the sidecar) and eliminates the pooler as a single point of failure. The trade-off is that the database sees more distinct connection sources. For PostgreSQL with RDS, use RDS Proxy instead — it is the AWS-managed equivalent and integrates with IAM auth.

Connection management is not glamorous infrastructure. It rarely comes up in architecture reviews. But pool exhaustion is one of the most common causes of database-induced outages at scale — and a correctly tuned pooler in front of your database is one of the highest-leverage reliability improvements you can make with a single afternoon of work.