Event-Driven Architecture Operations
Event-Driven Architecture Operations
Event-driven architecture (EDA) shifts the reliability contract from synchronous request-response to asynchronous message delivery. In a serverless context this means your Lambda function is no longer the server — the event broker is. That distinction has profound operational consequences. A synchronous HTTP call fails immediately and visibly; a failed event can sit silently in a queue, retry invisibly, duplicate, or arrive out of order. Understanding how to operate these systems with production discipline — dead-letter queues, retry policies, idempotency contracts, and ordering guarantees — is the difference between an EDA that scales elegantly and one that corrupts data under load.
Dead-Letter Queues: Your Operational Safety Net
A dead-letter queue (DLQ) is the destination for events that exhausted their retry budget without successful processing. It is not an error log — it is a recoverable queue of unprocessed work. At Amazon scale, DLQs are non-negotiable: every async Lambda event source (SQS, SNS, EventBridge, Kinesis, DynamoDB Streams) must have a configured DLQ, and the DLQ itself must be monitored. An alarm-free DLQ that silently fills means orders are not fulfilling, payments are not posting, and inventory counts are drifting.
Configure a DLQ on an SQS event source mapping and wire an alarm in one Terraform block:
The ReportBatchItemFailures response type is critical at production scale. Without it, if a batch of 10 SQS messages is processed and message 7 fails, Lambda reports the entire batch as failed and all 10 messages become visible again. With partial batch failure reporting, only message 7 is returned to the queue:
Retry Policies: Designing for Inevitable Failure
Every async event invocation in AWS has a configurable retry policy. The defaults are designed for correctness, not cost — understanding them is essential before they fire in production.
- Async Lambda invocations (SNS, S3, EventBridge): AWS retries twice with exponential backoff before sending to the Lambda destination or DLQ. Total retry window is up to 6 hours. Use
aws lambda put-function-event-invoke-configto tune this. - SQS: controlled by the queue's
maxReceiveCount(redrive policy). Each failed receive increments the count. A message stuck in processing for longer than the visibility timeout increments the count without the Lambda reporting failure — a silent retry burn. - Kinesis / DynamoDB Streams: retries until success or data expiry (default 24 hours for Kinesis, 24 hours for DynamoDB Streams). A poison-pill record that always fails will block the shard for the full retention period. Configure
bisectBatchOnFunctionErroranddestinationConfig.onFailureto isolate and route failures. - EventBridge Pipes: configurable retry attempts (0–185) and maximum age (60 seconds–24 hours). Enrichment failures do not retry; filter mismatches are dropped silently.
Tune the async Lambda event invoke config for any function triggered by SNS or S3:
random.uniform(0, min(cap, base * 2 ** attempt)). Without jitter, a thundering herd of retries after a downstream outage re-saturates the recovering service simultaneously. This is a well-documented failure mode at Netflix, Amazon, and Google.
Idempotency: The Contract That Makes Retries Safe
Retries are only safe if your handlers are idempotent: processing the same event twice produces the same side effects as processing it once. At production scale this is not optional — AWS itself documents "at-least-once delivery" for every async event source. The question is not if your handler receives a duplicate; it is when.
There are three standard idempotency patterns used in production EDA systems:
- Idempotency key in the datastore: write the event ID to a DynamoDB table with a conditional put. If the item already exists, skip processing. This is the most reliable pattern and survives Lambda restarts.
- AWS Lambda Powertools idempotency: a DynamoDB-backed decorator that handles the conditional write, in-progress locking, and TTL expiry with three lines of code.
- Idempotent operations at the target: upserts (
INSERT ... ON CONFLICT DO UPDATEin PostgreSQL;UpdateItemwith a conditional expression in DynamoDB) are naturally idempotent for the record update itself, though side effects (sending an email, publishing a downstream event) still need the key pattern.
The DynamoDB table for the idempotency store needs a TTL attribute and a hash key of id (the SHA-256 of the event key). Provision it with on-demand capacity — bursts of retries cause bursty writes, and provisioned capacity here will throttle and defeat the purpose.
Ordering: When Sequence Matters
EDA systems have a spectrum of ordering guarantees. Choosing the wrong event source for a use case that requires ordering is a design defect that appears only under load.
For SQS FIFO, the message group ID is the ordering unit — all messages with the same group ID are processed in strict FIFO order by a single Lambda concurrency slot. This means a slow or failing message in one group does not block other groups, but concurrency within a group is always 1. For an order management system this is exactly right: order ID as the group ID ensures all state transitions for a given order (placed → paid → fulfilled → shipped) are processed sequentially without blocking unrelated orders.
Production Failure Modes in EDA Operations
Three failure patterns appear at scale that do not surface in staging environments:
- The poison-pill event on Kinesis: a malformed record that always throws an exception will block a shard indefinitely until the retention period expires (24 hours–365 days). Configure
bisectBatchOnFunctionError: trueand adestinationConfig.onFailure(SQS DLQ) on the Kinesis event source mapping. The bisect option cuts the failing batch in half recursively until the single poison record is isolated and routed to the DLQ — shard processing resumes for the rest of the stream. - Clock-skew ordering failures: events published from multiple producers with wall-clock timestamps are not reliably ordered by timestamp in SQS Standard or EventBridge because clock skew between producers can be 100 ms or more and SQS does not re-sort. Use a monotonic sequence number from the source database (e.g. Postgres
xmin, DynamoDB stream sequence number) not a client-generated timestamp for events where order matters. - Idempotency table becoming a hot partition: if all events for a high-traffic entity map to the same DynamoDB partition key (e.g. a global idempotency table with a flat hash key), you will hit the 1,000 write/second per-partition limit. Shard the idempotency table by prefixing the key with a random 0–9 digit, and use a global secondary index if you need point-in-time lookup.
Operating event-driven serverless systems at scale demands a different mental model than operating synchronous APIs. Failures are deferred and silent; retries are automatic and hidden; duplicates are expected, not exceptional. The engineers who run EDA systems well treat the DLQ not as an alarm but as a first-class operational surface — they have runbooks for every DLQ, metrics for retry rates and idempotency cache hit rates, and they practice DLQ redrive in game days before they need it in an incident.