Cloud Fundamentals: AWS Core Services

EBS & Instance Storage

18 min Lesson 4 of 30

EBS & Instance Storage

Storage is where most production incidents involving EC2 begin. Picking the wrong volume type costs you either money or performance; skipping encryption costs you compliance; neglecting snapshots costs you your data. This lesson covers every dimension of AWS block storage that a production engineer is responsible for — and the failure modes that separate experienced teams from novices.

EBS Volume Types

Elastic Block Store (EBS) provides network-attached, persistent block devices. Each volume lives in a single Availability Zone and is automatically replicated within that AZ. AWS offers four volume families:

gp3 (General Purpose SSD) — The default for almost everything. Baseline 3,000 IOPS and 125 MB/s throughput, independently configurable up to 16,000 IOPS and 1,000 MB/s. Cost-optimized: throughput is decoupled from size (unlike gp2). Use this for OS volumes, application servers, databases under moderate load, and CI/CD runners.
gp2 (General Purpose SSD — legacy) — IOPS burst to 3,000 tied to a credit bucket; baseline scales with size (3 IOPS/GB). Migrating existing gp2 volumes to gp3 is a standard cost-reduction exercise at scale: you get the same or better performance for roughly 20% less.
io2 Block Express (Provisioned IOPS SSD) — Up to 256,000 IOPS and 4,000 MB/s per volume. Sub-millisecond latency. Required for Oracle, SQL Server, and high-throughput PostgreSQL at scale. A single io2 Multi-Attach volume can be attached to up to 16 Nitro instances simultaneously — critical for cluster-aware storage in high-availability databases.
st1 (Throughput Optimized HDD) — Sequential workloads: Kafka log segments, data lake ingestion, Hadoop. 500 MB/s throughput at a fraction of SSD cost. IOPS ceiling is low; random access is slow. Never use for OS volumes.
sc1 (Cold HDD) — Archive. Lowest cost per GB on EBS. Max 250 MB/s. Use for rarely-accessed data that must stay block-level (compliance retention, cold backups).

EBS volume families and their production use cases.

Instance Store: Ephemeral NVMe

Instance store volumes are physically attached NVMe SSDs on the hypervisor host. They deliver the highest raw throughput on EC2 — some instance types (i4i.metal) expose 60 TB of NVMe with millions of IOPS at sub-100 microsecond latency. The cost: all data is lost when the instance stops or is terminated. The hardware is never preserved across host migrations.

Legitimate production uses: Kafka broker log segments (replicated at the application layer), Cassandra SSTables (replicated across nodes), ElasticSearch warm data, distributed shuffle buffers in Spark. The pattern is always the same: the application layer handles durability; instance store handles speed.

Never place a primary database volume, a message queue spool that is the system of record, or any data without an external replica on instance store. Teams that ignore this rule discover the limitation at 3 AM during an AWS host maintenance event.

Snapshots

EBS snapshots are incremental, point-in-time backups stored in S3 (managed by AWS — not in your bucket). The first snapshot copies the entire volume; subsequent snapshots copy only changed blocks. Deletion is safe: AWS tracks block references across the chain and never removes a block still referenced by another snapshot.

# Create a snapshot with a descriptive tag
aws ec2 create-snapshot \
  --volume-id vol-0abc123def456 \
  --description "pre-deploy-$(date +%Y%m%d-%H%M)" \
  --tag-specifications 'ResourceType=snapshot,Tags=[{Key=Env,Value=prod},{Key=Service,Value=api-db}]'

# List snapshots owned by your account
aws ec2 describe-snapshots --owner-ids self \
  --filters "Name=tag:Service,Values=api-db" \
  --query 'Snapshots[*].[SnapshotId,StartTime,State,VolumeSize]' \
  --output table

# Restore: create a volume from a snapshot in a specific AZ
aws ec2 create-volume \
  --snapshot-id snap-0xyz789 \
  --availability-zone us-east-1a \
  --volume-type gp3 \
  --iops 6000 \
  --throughput 500 \
  --encrypted \
  --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=restored-api-db}]'

For automated lifecycle, use Data Lifecycle Manager (DLM). Define a policy that targets volumes by tag, creates daily snapshots, retains the last 14, and copies to a secondary region for disaster recovery. This replaces manual cron jobs and is the production standard.

Before every AMI bake or major deployment, snapshot your root volume. The snapshot takes seconds to initiate (even on multi-TB volumes) and is the fastest rollback path available — faster than any re-deploy pipeline.

Encryption

EBS encryption uses AES-256 with AWS KMS keys. When enabled, all data at rest on the volume, all data in transit between the volume and the instance, and all snapshots derived from the volume are encrypted. Encryption is transparent to the OS — no application changes required.

# Enable account-wide encryption by default (do this in every new account/region)
aws ec2 enable-ebs-encryption-by-default --region us-east-1

# Verify the setting
aws ec2 get-ebs-encryption-by-default --region us-east-1

# Encrypt an existing unencrypted volume:
# 1. Snapshot the volume
# 2. Copy the snapshot with encryption enabled
aws ec2 copy-snapshot \
  --source-region us-east-1 \
  --source-snapshot-id snap-0abc123 \
  --description "encrypted copy" \
  --encrypted \
  --kms-key-id alias/aws/ebs

# 3. Create new volume from the encrypted snapshot, attach, swap

Use a Customer Managed Key (CMK) in KMS rather than the AWS-managed key (alias/aws/ebs) in any environment where you need: key rotation control, cross-account snapshot sharing, fine-grained IAM on key usage, or audit trails in CloudTrail per key. The CMK incurs $1/month plus API call costs — trivial against the compliance value.

Enabling encryption by default only affects new volumes. Existing unencrypted volumes remain unencrypted. Audit your account regularly with AWS Config rule encrypted-volumes. In regulated environments (PCI-DSS, HIPAA), unencrypted EBS is a finding.

Performance Tuning

EBS throughput has two ceilings: the volume limit and the instance's EBS-optimized bandwidth limit. A gp3 at 16,000 IOPS is useless if your t3.medium EBS bandwidth cap is 2,085 Mbps. Always match volume configuration to instance EBS baseline bandwidth.

# Check instance EBS bandwidth limits
aws ec2 describe-instance-types \
  --instance-types r6i.2xlarge \
  --query 'InstanceTypes[*].EbsInfo' \
  --output json

# Benchmark volume IOPS with fio (run on the instance)
sudo fio \
  --filename=/dev/nvme1n1 \
  --direct=1 \
  --rw=randread \
  --bs=4k \
  --numjobs=32 \
  --iodepth=256 \
  --runtime=60 \
  --time_based \
  --group_reporting \
  --name=4k-rand-read

# Monitor EBS CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeReadOps \
  --dimensions Name=VolumeId,Value=vol-0abc123def456 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average

Key CloudWatch metrics to alarm on: VolumeQueueLength (sustained > 1 per provisioned IOPS indicates saturation), BurstBalance on gp2 (below 20% means you need to migrate to gp3 or resize), and VolumeIdleTime to identify over-provisioned volumes that can be downsized.

For databases, always set the Linux I/O scheduler to none (deadline/noop is legacy advice) and increase the read-ahead value for sequential workloads. EBS-optimized is enabled by default on all current-generation instances but verify this when working with older instance types brought forward in a migration.

The single highest-leverage action for most teams: run aws ec2 enable-ebs-encryption-by-default in every region of every account, and a DLM policy that snapshots tagged volumes daily with 14-day retention. These two controls, applied at account creation time, prevent the most common storage-related incidents and audit findings.