Cloud Fundamentals: AWS Core Services

RDS & Managed Databases

18 min Lesson 6 of 30

RDS & Managed Databases

Amazon RDS (Relational Database Service) offloads the undifferentiated heavy lifting of database operations — OS patching, storage provisioning, replication setup, backups, and failover — so your team can concentrate on schema design and query performance. But "managed" does not mean "magic." To run RDS safely at production scale you need to understand its internals: which engine to pick, how Multi-AZ failover actually works, when read replicas help (and when they do not), and how parameter groups let you tune the engine without SSH access.

Supported Engines

RDS supports six database engines. Each engine on RDS is the genuine upstream open-source or commercial binary — not a fork — so your existing expertise, drivers, and ORMs transfer without modification.

  • MySQL — the default choice for web applications. RDS MySQL supports versions 5.7 and 8.0. Choose 8.0 for window functions, CTEs, and better InnoDB defaults.
  • PostgreSQL — preferred at big-tech for its strict SQL compliance, JSONB indexing, logical replication, and extensions. RDS supports pg 13–16. Use pg_stat_statements and auto_explain via parameter groups to diagnose slow queries without shell access.
  • MariaDB — a MySQL-compatible fork. Its thread-pool connection model makes it attractive for high-concurrency workloads that overwhelm MySQL single-thread-per-connection design.
  • Oracle and SQL Server — available for enterprise license-portability scenarios (Bring Your Own License or License Included). Avoid for greenfield work; the managed feature set lags behind the open-source engines.
  • Amazon Aurora — AWS cloud-native engine with MySQL- and PostgreSQL-compatible endpoints. It replaces the on-disk storage layer with a distributed, self-healing 6-copy storage volume spread across three AZs. Aurora lives under the RDS console but is a distinct product. At scale, Aurora storage I/O eliminates replication lag for reads and reduces recovery time from minutes to under 30 seconds.
For new production services at scale, default to Aurora PostgreSQL. It combines the richest SQL feature set (PostgreSQL) with Aurora storage architecture: no binlog-based replication, no storage-limit headaches, near-zero failover time. Fall back to standard RDS PostgreSQL only when you need extensions Aurora does not support, for example TimescaleDB.

Multi-AZ vs. Read Replicas

These two features are frequently confused because both involve a second database instance. They solve entirely different problems.

Multi-AZ is a high-availability mechanism. AWS provisions a synchronous standby replica in a different Availability Zone. Every write to the primary is synchronously committed to the standby before the write returns success to the application. If the primary fails — hardware fault, AZ outage, OS crash — RDS updates the DNS CNAME for your endpoint to point to the standby within 60–120 seconds. Your application reconnects to the same hostname and resumes. The standby is invisible to your application during normal operation; it is never used for reads. You pay for it in full to have it idle and ready. This is the correct trade-off for production: reliability over cost.

Read Replicas are a read-scaling mechanism. RDS uses asynchronous replication (binary log for MySQL/MariaDB, streaming replication for PostgreSQL) to ship writes from the primary to one or more replicas. Each replica gets its own DNS endpoint and can serve SELECT queries, offloading the primary. Replication is asynchronous, so replicas can lag behind the primary by milliseconds to seconds under heavy write load. Reads from a replica are eventually consistent, not strongly consistent. Read replicas are not automatic failover targets by default — you must manually promote one and update your connection strings. You can create up to 5 read replicas per RDS instance (15 for Aurora).

Multi-AZ Standby vs Read Replicas AZ-A (us-east-1a) AZ-B (us-east-1b) Application single DB endpoint Primary Reads + Writes Multi-AZ Standby No app traffic (idle) SYNC R/W (normal op) failover path Read Replicas — async, read scale-out Read Replica 1 SELECT only · AZ-A Read Replica 2 SELECT only · AZ-B ASYNC read traffic Legend Synchronous (Multi-AZ) Asynchronous (replica) Failover (auto DNS flip) Primary / App Read Replica Multi-AZ Standby
Multi-AZ standby (synchronous, no traffic) coexisting with two read replicas (asynchronous, SELECT traffic) on the same primary.
A common production mistake: routing reporting queries to a read replica and assuming the data is strongly consistent. It is not. A write committed one second ago on the primary may not yet be visible on the replica. Any code path that requires read-after-write consistency — for example, showing a record immediately after creating it — must always read from the primary endpoint.

Automated Backups and Snapshots

RDS provides two distinct backup mechanisms. Understanding the difference is critical for defining your RTO and RPO.

Automated Backups are enabled by default (retention 1–35 days). RDS takes a daily storage-volume snapshot during the configured backup window and continuously archives transaction logs to S3. This combination enables point-in-time recovery (PITR): you can restore your database to any second within the retention window. Restoration creates a new DB instance; it does not restore in place.

Manual Snapshots are created explicitly via console, CLI, or automation. They persist until you delete them, surviving even instance deletion. Use them before schema migrations, major version upgrades, or any destructive operation. They do not enable PITR — they are a full volume snapshot at a single point in time.

# Create a manual snapshot before a risky migration aws rds create-db-snapshot \ --db-instance-identifier prod-postgres \ --db-snapshot-identifier prod-postgres-pre-migration-2025-07-01 # List all snapshots for an instance aws rds describe-db-snapshots \ --db-instance-identifier prod-postgres \ --query 'DBSnapshots[*].{ID:DBSnapshotIdentifier,Status:Status,Time:SnapshotCreateTime}' \ --output table # Point-in-time restore — always creates a NEW instance, not in-place aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier prod-postgres \ --target-db-instance-identifier prod-postgres-restored \ --restore-time 2025-07-01T03:00:00Z \ --db-instance-class db.r7g.xlarge \ --multi-az # After restore: validate data, then cut over via Route 53 CNAME # or update the endpoint in AWS Parameter Store — never hardcode endpoints
Set your automated backup retention to at least 7 days in production, and test PITR quarterly. Many teams discover during an actual incident that their restore procedure is broken or takes far longer than their RTO allows. Restoration drills are as important as the backups themselves.

Parameter Groups: Engine Tuning Without Shell Access

A parameter group is a named collection of engine configuration variables (the equivalent of my.cnf or postgresql.conf) that you attach to an RDS instance or cluster. Because you have no OS access on RDS, parameter groups are the only way to tune the database engine.

Parameters fall into two categories. Static parameters require an instance reboot to take effect (for example, max_connections, shared_buffers). Dynamic parameters apply immediately without a reboot (for example, log_min_duration_statement, work_mem). RDS ships a default parameter group for every engine version; the default group is immutable — you must create a custom group to change anything.

# Create a custom parameter group for PostgreSQL 16 aws rds create-db-parameter-group \ --db-parameter-group-name prod-pg16-custom \ --db-parameter-group-family postgres16 \ --description "Production PostgreSQL 16 tuning" # Apply tuning parameters (dynamic — no reboot required) aws rds modify-db-parameter-group \ --db-parameter-group-name prod-pg16-custom \ --parameters \ "ParameterName=log_min_duration_statement,ParameterValue=1000,ApplyMethod=immediate" \ "ParameterName=work_mem,ParameterValue=65536,ApplyMethod=immediate" \ "ParameterName=random_page_cost,ParameterValue=1.1,ApplyMethod=immediate" # Apply static parameter (requires reboot) aws rds modify-db-parameter-group \ --db-parameter-group-name prod-pg16-custom \ --parameters \ "ParameterName=shared_buffers,ParameterValue={DBInstanceClassMemory/4},ApplyMethod=pending-reboot" # Attach the parameter group to the instance aws rds modify-db-instance \ --db-instance-identifier prod-postgres \ --db-parameter-group-name prod-pg16-custom \ --apply-immediately # Verify the pending reboot parameters aws rds describe-db-parameters \ --db-parameter-group-name prod-pg16-custom \ --source user \ --query 'Parameters[*].{Name:ParameterName,Value:ParameterValue,Apply:ApplyType}'

Key parameters worth tuning for PostgreSQL production workloads:

  • shared_buffers — RDS formula {DBInstanceClassMemory/4} sets this to 25% of RAM, the standard starting point. Static; requires reboot.
  • work_mem — memory per sort/hash operation. Default 4 MB is usually too low for analytical queries. Raise cautiously: this multiplies per query per connection. Dynamic.
  • log_min_duration_statement — log any query exceeding N milliseconds. Set to 1000 initially, lower to 200 once you have solved the obvious slow queries. Dynamic.
  • random_page_cost — set to 1.1 on SSD-backed gp3/io1 storage (the default 4.0 was designed for spinning disks and causes the planner to over-prefer sequential scans on SSDs). Dynamic.
  • max_connections — static. On large instances, use PgBouncer or RDS Proxy in front of RDS instead of raising this value; connection overhead is significant in PostgreSQL.
Parameter group changes to static parameters mark the instance as "pending reboot." The instance continues running the old values until you reboot it. Schedule the reboot during a maintenance window and verify the change took effect with SHOW shared_buffers; after the reboot — do not assume the change applied.

Storage: gp3 vs. io1 vs. io2

RDS storage performance directly affects database latency. Choose based on your IOPS and throughput requirements, not instance size alone.

  • gp3 — general-purpose SSD. 3,000 IOPS and 125 MB/s baseline, upgradeable to 16,000 IOPS and 1,000 MB/s independently of storage size. The right default for most production workloads.
  • io2 — provisioned IOPS SSD. Up to 256,000 IOPS with 99.999% durability SLA. Use for demanding OLTP workloads where latency must be sub-millisecond under sustained load.
  • Aurora — storage is managed separately; you do not provision IOPS. Aurora's distributed storage auto-scales to 128 TiB and handles its own I/O. This is one of Aurora's strongest advantages over RDS.
For cost-efficiency: migrate existing RDS instances from gp2 to gp3. On gp2, IOPS scale linearly with storage size (3 IOPS/GiB), meaning teams over-provision storage just to hit an IOPS target. With gp3 you provision IOPS independently — typically 40–50% cheaper for the same performance. Run aws rds describe-db-instances --query 'DBInstances[*].{ID:DBInstanceIdentifier,Storage:StorageType}' to audit your fleet.

Provisioning with the CLI: End-to-End Example

The following creates a production-grade Multi-AZ PostgreSQL instance with a custom parameter group and automated backups already configured.

# 1. Create a dedicated DB subnet group (spans at least two AZs) aws rds create-db-subnet-group \ --db-subnet-group-name prod-db-subnets \ --db-subnet-group-description "Private subnets for RDS" \ --subnet-ids subnet-0abc1234 subnet-0def5678 # 2. Create the instance aws rds create-db-instance \ --db-instance-identifier prod-postgres \ --db-instance-class db.r7g.xlarge \ --engine postgres \ --engine-version 16.2 \ --master-username dbadmin \ --master-user-password "$(aws secretsmanager get-secret-value \ --secret-id prod/rds/master-password --query SecretString --output text)" \ --allocated-storage 200 \ --storage-type gp3 \ --iops 6000 \ --storage-encrypted \ --kms-key-id alias/prod-rds \ --multi-az \ --db-subnet-group-name prod-db-subnets \ --vpc-security-group-ids sg-0123456789abcdef0 \ --db-parameter-group-name prod-pg16-custom \ --backup-retention-period 14 \ --preferred-backup-window "02:00-03:00" \ --preferred-maintenance-window "sun:04:00-sun:05:00" \ --enable-performance-insights \ --performance-insights-retention-period 7 \ --deletion-protection \ --no-publicly-accessible \ --tags Key=Environment,Value=prod Key=Team,Value=platform # 3. Create a read replica in a second region for disaster recovery aws rds create-db-instance-read-replica \ --db-instance-identifier prod-postgres-replica-us-west-2 \ --source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-postgres \ --db-instance-class db.r7g.large \ --region us-west-2 \ --storage-encrypted \ --kms-key-id alias/prod-rds-west

Important flags to note: --storage-encrypted enables AES-256 encryption at rest (mandatory for compliance); --deletion-protection prevents accidental instance deletion via console or CLI; --no-publicly-accessible ensures the instance has no public IP and is reachable only within your VPC; --enable-performance-insights gives you a free 7-day query performance dashboard that surfaces wait events and top SQL.

Never embed database passwords in CLI commands or application config files. Store credentials in AWS Secrets Manager and rotate them automatically. Use RDS native password rotation with Secrets Manager — AWS handles the rotation Lambda, the new password generation, and the atomic credential swap without downtime.