Serverless & Event-Driven Operations

Lambda in Production

18 min Lesson 2 of 28

Lambda in Production

AWS Lambda is not a toy. Netflix, Amazon itself, Nordstrom, and hundreds of other companies route billions of production invocations through it every day. But moving from a working demo to a function that survives real traffic requires understanding four non-negotiable dimensions: runtimes, memory and CPU allocation, timeout strategy, and concurrency models. Getting any one of these wrong at scale means silent performance degradation, runaway costs, or hard outages at 2 AM.

Runtimes: What Runs Where

Lambda supports two categories of runtimes. Managed runtimes (Node.js 20/22, Python 3.12/3.13, Java 21, .NET 8, Ruby 3.3, Go via provided.al2023) are maintained by AWS — patches are applied automatically when a new runtime minor version ships. Custom runtimes via the provided.al2023 base layer let you bring any binary that implements the Lambda Runtime API bootstrap contract.

For production, the runtime choice drives three things: cold-start latency, available concurrency limits (all are equal), and your team's operational familiarity. Python and Node.js cold-start in ~100–400 ms; Java with class-data sharing (CDS) lands around 500–900 ms without SnapStart; Java 21 with SnapStart can reach sub-100 ms by restoring from a snapshot taken after JVM initialization. Go (via provided.al2023) cold-starts in ~10–50 ms and is the default choice for latency-critical control-plane functions at several hyperscalers.

SnapStart (Java 21): When you publish a new version with SnapStart enabled, Lambda initializes the function, takes a Firecracker microVM snapshot, and restores from that snapshot on cold starts. Encryption is automatic. The trade-off: initialization code must be idempotent — do not open database connections or seed random number generators in static initializers if they carry state that must not be shared across restored snapshots.

In every runtime, the execution environment is a Firecracker microVM with an Amazon Linux 2023 root. The environment is reused across warm invocations, but you must treat it as ephemeral: anything written to /tmp (512 MB by default, up to 10 GB) may persist for the lifetime of the environment but disappears when the environment is recycled. Do not use /tmp as a durable store.

Memory, CPU, and the Right Sizing Problem

Lambda's resource model is deliberately simple: you configure one number — memory — from 128 MB to 10,240 MB in 1 MB increments. CPU is allocated proportionally: 1,769 MB of memory gives you exactly one full vCPU; 3,538 MB gives you two; 10,240 MB gives you six. There is no independent CPU knob.

This creates a right-sizing problem that trips up many teams. Increasing memory from 512 MB to 1,024 MB doubles your cost-per-GB-second but also roughly halves wall-clock duration for CPU-bound workloads, often resulting in the same or lower total cost while cutting p99 latency in half. The only way to find the optimum is to measure.

AWS Lambda Power Tuning: Run the open-source Lambda Power Tuning Step Functions state machine against your function. It fires N invocations at each memory level from 128 MB to 10 GB, plots cost vs. speed, and returns the Pareto frontier. Run it against realistic payloads, not synthetic ones. At companies like Capital One and Expedia, this tool consistently finds a 20–40 % cost reduction with equal or better latency.

For I/O-bound functions (waiting on DynamoDB, S3, downstream APIs), the proportional CPU benefit largely disappears, and 512–1,024 MB is typically sufficient. For CPU-bound workloads — image processing, ML inference, cryptographic operations, zip/gzip in-memory — scaling memory to 3,008 MB or higher often makes economic sense.

Lambda memory-to-CPU allocation and invocation flow Memory Config 128 MB – 10,240 MB 1,769 MB = 1 vCPU Firecracker MicroVM Runtime (Node / Python / Java…) Handler + /tmp (up to 10 GB) Warm reuse until recycled Response Sync: up to 6 MB Async: S3 / SQS Concurrency Plane Reserved: hard cap per function Provisioned: pre-warmed environments Burst limit: 3,000 init/min (us-east-1) Timeout Min: 1 s | Max: 15 min Exceeded → SIGTERM → force-kill → charged full duration
Lambda execution environment: memory drives CPU allocation; concurrency and timeout are independent control planes.

Timeouts: A Contract, Not a Safety Net

Lambda timeouts range from 1 second to 15 minutes. Teams routinely set them to the maximum "just in case," which is one of the most expensive mistakes in serverless operations. A Lambda function that hangs waiting on a downstream service that is down will hold its concurrency slot for the full 15 minutes, blocking all other invocations from that reserved pool and silently accumulating GB-second charges.

The correct mental model is that a timeout is a contract: it defines the worst-case acceptable duration for the function's business logic. Set it to roughly 2–3× your p99 measured duration in production. If your DynamoDB read typically completes in 12 ms and p99 is 80 ms, a 2-second timeout is generous. If your function coordinates external calls, use the AWS SDK's built-in connectTimeout and socketTimeout (or equivalent per-SDK call timeout) to fail fast inside the function — do not rely on the Lambda timeout as the only circuit breaker.

Async invocations and timeouts: For functions triggered asynchronously (SNS, S3 events, EventBridge), a timeout does not trigger a retry by itself. Lambda will attempt the invocation up to three times (configurable), but only because of execution failure, not because the timeout was exceeded on an earlier attempt that already consumed the event. Always pair async functions with a Dead Letter Queue (DLQ) or an onFailure destination so timed-out events do not vanish silently.

For long-running orchestration workloads (ETL, ML batch scoring, document processing), the right answer is usually Step Functions rather than a single 15-minute Lambda. Step Functions have a one-year execution window and express workflows handle up to 5 min. Each Lambda step stays short, retries are explicit, and state is visible in the console.

Concurrency Models

Lambda concurrency is the number of in-flight invocations at any moment. Every AWS account starts with a regional limit of 1,000 concurrent executions (soft limit; service quota increase requests are routinely approved to 10,000+). Three levers control how that pool is allocated:

  1. Unreserved concurrency: The default. All functions in a region share the pool. A spike on one function can starve another — the classic "noisy neighbor" in a monorepo Lambda deployment.
  2. Reserved concurrency: Assigns a hard cap to a specific function (e.g., ReservedConcurrentExecutions: 50). Invocations beyond the cap receive a throttle error (HTTP 429). Use it to protect downstream systems from being overwhelmed and to guarantee capacity for critical functions.
  3. Provisioned concurrency: Pre-initializes N execution environments so they are warm and ready to serve requests with zero cold-start latency. You pay for provisioned environments whether invoked or not. The unit cost is roughly 0.015× the regular invocation cost per environment-hour, so it is economical only if traffic is sustained. Use Application Auto Scaling to scale provisioned concurrency with a scheduled action or a target-tracking policy tied to the ProvisionedConcurrencyUtilization metric.
The concurrency burst limit: Lambda can scale from zero to 3,000 concurrent executions in the first minute (in us-east-1; other regions vary: 500–3,000). After that, it scales by 500 additional environments per minute. If your function needs to absorb a sudden spike of 10,000 RPS with a 100 ms average duration, you need ~1,000 concurrent environments instantaneously — which exceeds the burst limit. Design event sources (SQS batch size, Kinesis shard count) to match your expected burst curve, and use provisioned concurrency to pre-stage capacity before a known traffic event (a product launch, a cron-triggered batch).

The following Terraform snippet expresses a production-grade Lambda configuration encapsulating all four dimensions discussed in this lesson:

# Terraform: production Lambda with right-sized memory, reserved + provisioned concurrency resource "aws_lambda_function" "api_handler" { function_name = "api-handler-prod" role = aws_iam_role.lambda_exec.arn runtime = "python3.12" handler = "main.handler" filename = data.archive_file.lambda_zip.output_path memory_size = 1024 # 1 GB — measured p99 duration halved vs 512 MB timeout = 10 # 2-3x p99 measured duration (p99 = 3.8 s) architectures = ["arm64"] # Graviton2: ~20 % cheaper, same or better perf reserved_concurrent_executions = 200 # protects downstream DynamoDB table environment { variables = { LOG_LEVEL = "WARNING" TABLE_NAME = aws_dynamodb_table.main.name } } snap_start { apply_on = "None" # Python; set "PublishedVersions" for Java 21 } } resource "aws_lambda_provisioned_concurrency_config" "api_handler" { function_name = aws_lambda_function.api_handler.function_name qualifier = aws_lambda_alias.live.name provisioned_concurrent_executions = 10 # keep 10 environments pre-warmed } resource "aws_appautoscaling_target" "lambda_pc" { max_capacity = 100 min_capacity = 5 resource_id = "function:${aws_lambda_function.api_handler.function_name}:${aws_lambda_alias.live.name}" scalable_dimension = "lambda:function:ProvisionedConcurrency" service_namespace = "lambda" } resource "aws_appautoscaling_policy" "lambda_pc_tracking" { name = "pc-utilization-tracking" policy_type = "TargetTrackingScaling" resource_id = aws_appautoscaling_target.lambda_pc.resource_id scalable_dimension = aws_appautoscaling_target.lambda_pc.scalable_dimension service_namespace = aws_appautoscaling_target.lambda_pc.service_namespace target_tracking_scaling_policy_configuration { target_value = 0.7 # scale when provisioned utilization exceeds 70 % predefined_metric_specification { predefined_metric_type = "LambdaProvisionedConcurrencyUtilization" } } }

Finally, prefer arm64 (Graviton2) over x86_64 for new functions whenever your runtime supports it. AWS charges approximately 20 % less per GB-second for arm64 invocations, and measured throughput for CPU-bound Python, Node.js, and Java workloads is equal or higher. The only reason not to use arm64 is a native dependency compiled for x86 that lacks an arm64 build — increasingly rare in 2025.

Observability hook: The Lambda service emits REPORT log lines on every invocation: Billed Duration, Memory Used, Init Duration (cold-start only), and Restore Duration (SnapStart only). Shipping these to CloudWatch Logs Insights or your OTEL collector and plotting the p50/p99/p999 of Billed Duration split by cold/warm is the single most valuable Lambda dashboard you can build. We cover this fully in Lesson 7 (Serverless Observability).