EBS & Instance Storage
EBS & Instance Storage
Storage is where most production incidents involving EC2 begin. Picking the wrong volume type costs you either money or performance; skipping encryption costs you compliance; neglecting snapshots costs you your data. This lesson covers every dimension of AWS block storage that a production engineer is responsible for — and the failure modes that separate experienced teams from novices.
EBS Volume Types
Elastic Block Store (EBS) provides network-attached, persistent block devices. Each volume lives in a single Availability Zone and is automatically replicated within that AZ. AWS offers four volume families:
- gp3 (General Purpose SSD) — The default for almost everything. Baseline 3,000 IOPS and 125 MB/s throughput, independently configurable up to 16,000 IOPS and 1,000 MB/s. Cost-optimized: throughput is decoupled from size (unlike
gp2). Use this for OS volumes, application servers, databases under moderate load, and CI/CD runners. - gp2 (General Purpose SSD — legacy) — IOPS burst to 3,000 tied to a credit bucket; baseline scales with size (3 IOPS/GB). Migrating existing
gp2volumes togp3is a standard cost-reduction exercise at scale: you get the same or better performance for roughly 20% less. - io2 Block Express (Provisioned IOPS SSD) — Up to 256,000 IOPS and 4,000 MB/s per volume. Sub-millisecond latency. Required for Oracle, SQL Server, and high-throughput PostgreSQL at scale. A single
io2Multi-Attach volume can be attached to up to 16 Nitro instances simultaneously — critical for cluster-aware storage in high-availability databases. - st1 (Throughput Optimized HDD) — Sequential workloads: Kafka log segments, data lake ingestion, Hadoop. 500 MB/s throughput at a fraction of SSD cost. IOPS ceiling is low; random access is slow. Never use for OS volumes.
- sc1 (Cold HDD) — Archive. Lowest cost per GB on EBS. Max 250 MB/s. Use for rarely-accessed data that must stay block-level (compliance retention, cold backups).
Instance Store: Ephemeral NVMe
Instance store volumes are physically attached NVMe SSDs on the hypervisor host. They deliver the highest raw throughput on EC2 — some instance types (i4i.metal) expose 60 TB of NVMe with millions of IOPS at sub-100 microsecond latency. The cost: all data is lost when the instance stops or is terminated. The hardware is never preserved across host migrations.
Legitimate production uses: Kafka broker log segments (replicated at the application layer), Cassandra SSTables (replicated across nodes), ElasticSearch warm data, distributed shuffle buffers in Spark. The pattern is always the same: the application layer handles durability; instance store handles speed.
Snapshots
EBS snapshots are incremental, point-in-time backups stored in S3 (managed by AWS — not in your bucket). The first snapshot copies the entire volume; subsequent snapshots copy only changed blocks. Deletion is safe: AWS tracks block references across the chain and never removes a block still referenced by another snapshot.
For automated lifecycle, use Data Lifecycle Manager (DLM). Define a policy that targets volumes by tag, creates daily snapshots, retains the last 14, and copies to a secondary region for disaster recovery. This replaces manual cron jobs and is the production standard.
Encryption
EBS encryption uses AES-256 with AWS KMS keys. When enabled, all data at rest on the volume, all data in transit between the volume and the instance, and all snapshots derived from the volume are encrypted. Encryption is transparent to the OS — no application changes required.
Use a Customer Managed Key (CMK) in KMS rather than the AWS-managed key (alias/aws/ebs) in any environment where you need: key rotation control, cross-account snapshot sharing, fine-grained IAM on key usage, or audit trails in CloudTrail per key. The CMK incurs $1/month plus API call costs — trivial against the compliance value.
encrypted-volumes. In regulated environments (PCI-DSS, HIPAA), unencrypted EBS is a finding.
Performance Tuning
EBS throughput has two ceilings: the volume limit and the instance's EBS-optimized bandwidth limit. A gp3 at 16,000 IOPS is useless if your t3.medium EBS bandwidth cap is 2,085 Mbps. Always match volume configuration to instance EBS baseline bandwidth.
Key CloudWatch metrics to alarm on: VolumeQueueLength (sustained > 1 per provisioned IOPS indicates saturation), BurstBalance on gp2 (below 20% means you need to migrate to gp3 or resize), and VolumeIdleTime to identify over-provisioned volumes that can be downsized.
For databases, always set the Linux I/O scheduler to none (deadline/noop is legacy advice) and increase the read-ahead value for sequential workloads. EBS-optimized is enabled by default on all current-generation instances but verify this when working with older instance types brought forward in a migration.
aws ec2 enable-ebs-encryption-by-default in every region of every account, and a DLM policy that snapshots tagged volumes daily with 14-day retention. These two controls, applied at account creation time, prevent the most common storage-related incidents and audit findings.