Database Observability
Database Observability
An on-call engineer who cannot answer "which query is killing us right now?" in under 60 seconds is flying blind. Database observability is not just about dashboards — it is a disciplined set of signals, collection pipelines, and alert rules that let you go from "something is wrong" to "this specific query, holding this lock, on this replica, with this lag" in minutes rather than hours. This lesson covers the four pillars that matter most in production: slow-query telemetry, lock analysis, replication-lag tracking, and query-stats aggregation.
Slow Query Telemetry
Every production database engine has a built-in mechanism to log queries that exceed a configurable threshold. Enabling this correctly is the single highest-return observability investment you can make.
PostgreSQL — log_min_duration_statement
Set log_min_duration_statement = 1000 (milliseconds) in postgresql.conf or via ALTER SYSTEM to log any statement taking longer than one second. In high-traffic environments start at 500 ms and lower as you build confidence. The log line includes duration, query text, bind parameters (if log_min_duration_statement is combined with log_parameters), and the query plan can be captured by pairing with auto_explain.
The shared_blks_read column reveals cache-miss pressure — a query with high reads relative to hits is pulling data from disk, a red flag regardless of its wall-clock time. Reset the stats between deploys with SELECT pg_stat_statements_reset(); to get a clean baseline.
MySQL / MariaDB — Slow Query Log
Set slow_query_log = ON, long_query_time = 1, and log_queries_not_using_indexes = ON. Parse aggregated reports with pt-query-digest (Percona Toolkit) — it normalizes query fingerprints, computes percentiles, and emits structured output that plugs directly into your observability stack.
pg_stat_statements histograms exposed by postgres_exporter.
Lock Analysis
Locks are the silent killer in multi-writer databases. A transaction holding a row lock while waiting for user input, or an accidental table-level lock from a schema migration, can queue up hundreds of waiting connections in seconds.
lock_timeout in PostgreSQL is 0 (infinite wait). In MySQL the default innodb_lock_wait_timeout is 50 seconds. Both values are dangerously permissive in high-throughput systems. Set lock_timeout = 5s in PostgreSQL at the application level and instrument your connection pool to alert when lock-wait errors spike — they are a leading indicator of a deeper contention problem.
Replication Lag Metrics
Replication lag is the seconds (or bytes) by which a replica trails the primary. It is a critical SLO signal: reads from a lagged replica return stale data, and a replica that cannot keep up will never catch up under sustained write load. Big-tech teams alert on lag at two thresholds — a warning at 10 seconds and a page at 60 seconds — and treat any lag trend above zero as a performance bug.
PostgreSQL replication lag is measured in both bytes (LSN difference) and wall-clock seconds. The authoritative source is pg_stat_replication on the primary:
For MySQL, the canonical lag metric is Seconds_Behind_Master from SHOW SLAVE STATUS — exposed by mysqld_exporter as mysql_slave_status_seconds_behind_master. Treat this value critically: it resets to 0 when the IO thread stops, giving a false green. Always monitor both the lag value and the IO/SQL thread running status.
Query Stats Aggregation with pg_stat_statements and Performance Schema
Ad-hoc slow-query analysis catches individual incidents. Query stats aggregation catches systemic patterns — the query that is fast individually but accounts for 40% of total DB CPU because it runs 50,000 times per minute. This is the difference between firefighting and capacity planning.
PostgreSQL's pg_stat_statements normalizes query fingerprints and accumulates per-query counters across the lifetime of the server. MySQL's Performance Schema events_statements_summary_by_digest serves the same purpose. Both should be queried on a schedule (every 1–5 minutes) and their deltas shipped to your metrics backend.
postgres_exporter (or mysqld_exporter) scrapes engine metrics every 15 seconds. Grafana dashboards surface the four golden signals adapted for databases: error rate (connection failures, deadlocks), latency (p99 query time), traffic (queries/sec), and saturation (connection pool fill, replication lag). Loki ingests slow-query log lines for full-text search. PagerDuty routes critical alerts. This four-layer stack — scrape, store, visualize, alert — is the baseline at every major cloud company.
Building a Minimal Observability Baseline in 30 Minutes
If you are starting from zero on a new cluster, this is the order of operations that delivers 80% of production value fastest:
- Enable
pg_stat_statements(PostgreSQL) or Performance Schema digests (MySQL) — these are safe for production and are typically already on in managed services. - Deploy
postgres_exporterormysqld_exporteras a sidecar or systemd service; point Prometheus at it with a 15s scrape interval. - Import the community Grafana dashboards (ID 9628 for PostgreSQL, ID 7362 for MySQL) as a starting point, then customize thresholds for your traffic profile.
- Set
log_min_duration_statement = 500(Postgres) or enable the slow query log (MySQL); ship logs to Loki or CloudWatch Logs Insights. - Write three alert rules: replication lag critical, connection saturation above 80%, and slow-query rate above your baseline.
EXPLAIN ANALYZE on every new query in a pull request and fails the build if the plan cost exceeds a threshold. For most teams a simpler version works: ship pg_stat_statements deltas to Datadog or Grafana Cloud, and set anomaly-detection alerts on per-fingerprint execution time. The first time you catch a bad index removal in code review rather than a 3 AM page, you will understand why this investment pays off.
Database observability is not a one-time setup — it is a feedback loop. Every incident should end with a new alert rule or a new dashboard panel so the same failure mode is caught automatically next time. Teams that treat observability as a living system catch problems in minutes; teams that treat it as a checkbox catch them in hours or not at all.