Serverless & Event-Driven Operations

Cold Starts & Performance

18 min Lesson 3 of 28

Cold Starts & Performance

In a traditional long-running service, the runtime overhead of starting your process — loading the JVM, initialising the Spring context, pulling secrets — is paid once at deploy time and then amortised over millions of requests. In a serverless function, that overhead can be paid on every invocation that hits a cold execution environment. Understanding exactly where that latency comes from, and knowing when and how to mitigate it, is one of the most practically consequential skills for operating Lambda at scale.

Cold starts at production scale: At Amazon's own retail scale, even a 1 % cold-start rate on a high-traffic function with 10 ms cold-start overhead is practically invisible. At p50 it does not show up at all. The harm is at the tail: a Java function with a 3-second cold start can breach a 1-second API Gateway timeout and return a 504 to real users. The right engineering question is never "do cold starts exist?" but "what is my p99 cold-start latency, and does that breach my SLO?"

Cold Start Anatomy: What Actually Happens

When Lambda decides it needs a new execution environment for your function, it executes a fixed sequence of operations. Each has its own latency budget:

  1. Hypervisor slot allocation (~1–10 ms): Lambda runs on Firecracker micro-VMs. The control plane allocates a MicroVM slot. This is AWS-internal and you have zero influence over it.
  2. Runtime bootstrap (~50–500 ms for managed runtimes): The runtime process (python3.12, node20, the JVM for java21) is initialised inside the MicroVM. JVM-based runtimes pay the highest cost here — classloading, JIT warmup, and bytecode verification are inherently expensive. Node.js and Python are much cheaper (tens of milliseconds).
  3. Function init code (your code, outside the handler): Every line of module-level code in your Lambda — SDK client construction, database connection setup, environment variable reads, configuration parsing — executes in sequence before the handler is callable. This is the part you control completely and where the biggest wins live.
  4. Handler invocation: The actual handler runs for the first time. For the purposes of cold-start measurement, this is the finish line. The sum of steps 1–3 is the observable cold-start latency from the perspective of an upstream caller.
Lambda Cold Start vs Warm Start Anatomy Cold Start MicroVM Alloc ~5 ms Runtime Bootstrap 50–500 ms (runtime-dependent) Init Code your code — variable Handler business logic Total cold-start latency (caller observes) Warm Start Existing execution env (reused) MicroVM + runtime already running Handler ~0.1–2 ms overhead Warm latency only Provisioned Concurrency Pre-initialised execution environments kept warm — steps 1–3 already paid; every invocation starts at Handler
Cold start vs. warm start anatomy, and how Provisioned Concurrency eliminates the observable cold-start phase.

Measuring Cold Starts: The Right Metrics

AWS Lambda publishes an Init Duration field in CloudWatch Logs when a cold start occurs. This field is absent on warm invocations, which makes it straightforward to filter and measure. The metrics you should be tracking in production:

  • Init Duration p50/p95/p99: Extract via CloudWatch Logs Insights. The p99 is your worst-case caller experience and is the figure that matters for SLO compliance.
  • Cold-start rate: cold_starts / total_invocations over a sliding window. At sustained high traffic this approaches 0 %; at low-traffic or bursty patterns it can exceed 10 %.
  • Concurrency metrics: ConcurrentExecutions and UnreservedConcurrentExecutions from CloudWatch Metrics. A sudden spike in concurrent executions directly predicts a cold-start burst.
# CloudWatch Logs Insights query — cold start analysis # Run over your function's log group: /aws/lambda/<function-name> fields @timestamp, @requestId, @initDuration, @duration, @billedDuration | filter @initDuration > 0 | stats count() as cold_starts, avg(@initDuration) as avg_init_ms, pct(@initDuration, 95) as p95_init_ms, pct(@initDuration, 99) as p99_init_ms, max(@initDuration) as max_init_ms by bin(5m) | sort by bin(5m) desc | limit 60

Init Code Optimisation: The Highest-Leverage Work

Your init code (module-level initialisation outside the handler) runs once per cold start. Any latency you remove from init code is removed from every cold start permanently. This is where experienced engineers spend their time before reaching for Provisioned Concurrency. Common patterns:

  • Lazy-initialise optional clients: If a Secrets Manager client, a DynamoDB table reference, or an SQS queue URL is only needed in certain code paths, move it inside the handler or behind a module-level singleton that initialises on first call. Do not pay for it on every cold start if only 5 % of invocations use it.
  • Resolve secrets once and cache them at module level: A Secrets Manager GetSecretValue call is ~50 ms. If you call it inside the handler body, you pay 50 ms on every invocation. If you call it in init code, you pay 50 ms once per cold start. But take care: cache the resolved value in a module-level variable; the execution environment persists between warm invocations (this is the intended pattern).
  • Import only what you need: In Python and Node.js, importing entire SDKs when you only need one client loads significant amounts of code. Use path imports: from boto3 import client as boto_client instead of import boto3, or import { DynamoDBClient } from "@aws-sdk/client-dynamodb" instead of the entire V2 SDK barrel. In Node.js Lambda with the AWS SDK V3, this is especially impactful — V3 is modular specifically to reduce cold starts.
  • Avoid synchronous file I/O at init: Reading large configuration files, parsing JSON schemas, or compiling regex patterns at module level is common and expensive. Profile with AWS_LAMBDA_LOG_LEVEL=TRACE or a simple Date.now() diff around each init block to see where time is spent.
# Python Lambda — init code optimisation pattern import os import json from functools import lru_cache from boto3 import client as boto_client # ---- Module-level singletons (init once, reuse across warm invocations) ---- _ssm = boto_client("ssm", region_name=os.environ["AWS_REGION"]) _dyndb = boto_client("dynamodb", region_name=os.environ["AWS_REGION"]) # Lazy secret resolution with module-level cache _DB_PASSWORD: str | None = None def _get_db_password() -> str: global _DB_PASSWORD if _DB_PASSWORD is None: resp = _ssm.get_parameter( Name=os.environ["DB_PASS_PARAM"], WithDecryption=True, ) _DB_PASSWORD = resp["Parameter"]["Value"] return _DB_PASSWORD # ---- Handler — no SDK construction, no secret fetches ---- def handler(event, context): db_pass = _get_db_password() # free on warm invocations item = _dyndb.get_item( TableName=os.environ["TABLE_NAME"], Key={"pk": {"S": event["id"]}}, ) return {"statusCode": 200, "body": json.dumps(item.get("Item", {}))}

Provisioned Concurrency: Eliminating Cold Starts on Critical Paths

Provisioned Concurrency (PC) pre-initialises a specified number of execution environments, runs all init code, and keeps those environments ready to accept requests. From a caller's perspective, a PC invocation has zero init latency — it is indistinguishable from a warm invocation. You are paying for idle compute; the cost equation is therefore: cost of PC × reserved count × time vs. cost of cold starts × cold start rate × p99 init duration × SLO impact.

Where PC makes economic and operational sense:

  • APIs with strict p99 latency SLOs (payment flows, auth endpoints, real-time features)
  • JVM-based Lambdas (Java, Kotlin, Scala) where cold starts routinely exceed 1–3 seconds
  • Functions that are invoked at highly variable rates — after a period of zero traffic, the first burst of requests all cold-start simultaneously without PC
  • Scheduled jobs with a tight deadline — a Step Function task with a 10-second timeout and a 5-second cold start leaves no margin for the actual work
# Terraform — Provisioned Concurrency configuration resource "aws_lambda_function" "payment_api" { function_name = "payment-api-${var.env}" runtime = "java21" handler = "com.acme.PaymentHandler::handleRequest" memory_size = 1024 timeout = 30 filename = "payment-api.zip" environment { variables = { REGION = var.aws_region TABLE_NAME = var.dynamodb_table } } } # Publish a version — PC must be attached to a specific version, not $LATEST resource "aws_lambda_alias" "live" { name = "live" function_name = aws_lambda_function.payment_api.function_name function_version = aws_lambda_function.payment_api.version } resource "aws_lambda_provisioned_concurrency_config" "payment_api_pc" { function_name = aws_lambda_function.payment_api.function_name qualifier = aws_lambda_alias.live.name provisioned_concurrent_executions = 10 # Optional: auto-scale PC based on utilisation # See aws_appautoscaling_target + aws_appautoscaling_policy below } # Application Auto Scaling — scale PC between 5 and 50 based on utilisation resource "aws_appautoscaling_target" "pc_target" { max_capacity = 50 min_capacity = 5 resource_id = "function:${aws_lambda_function.payment_api.function_name}:live" scalable_dimension = "lambda:function:ProvisionedConcurrency" service_namespace = "lambda" } resource "aws_appautoscaling_policy" "pc_policy" { name = "payment-api-pc-tracking" policy_type = "TargetTrackingScaling" resource_id = aws_appautoscaling_target.pc_target.resource_id scalable_dimension = aws_appautoscaling_target.pc_target.scalable_dimension service_namespace = aws_appautoscaling_target.pc_target.service_namespace target_tracking_scaling_policy_configuration { target_value = 0.7 # scale up when PC utilisation exceeds 70% predefined_metric_specification { predefined_metric_type = "LambdaProvisionedConcurrencyUtilization" } scale_in_cooldown = 300 scale_out_cooldown = 60 } }
Target 70 % utilisation, not 100 %: If PC is at 100 % utilisation, any new concurrent invocation overflows to an on-demand cold start. A 70 % target gives you headroom to absorb sudden traffic spikes before auto-scaling responds (typically 60–90 seconds). The gap between your 70 % set-point and 100 % is your burst buffer — size it according to your traffic volatility.

SnapStart: Snapshot-Based Cold Start Mitigation for the JVM

AWS Lambda SnapStart (available for Java 21+ managed runtime) takes a snapshot of the fully-initialised execution environment after the init phase and stores it as a Firecracker memory snapshot. On a cold start, Lambda restores from this snapshot rather than reinitialising from scratch. In practice, this compresses JVM cold starts from 3–8 seconds down to 200–600 ms — without any code changes and without the per-instance cost of Provisioned Concurrency.

Enabling SnapStart is a single Terraform or console setting:

# Terraform — Lambda SnapStart for a Java 21 function resource "aws_lambda_function" "order_processor" { function_name = "order-processor-${var.env}" runtime = "java21" handler = "com.acme.OrderHandler::handleRequest" memory_size = 1024 timeout = 60 filename = "order-processor.zip" publish = true # SnapStart requires versioned deployments snap_start { apply_on = "PublishedVersions" } } # After enabling SnapStart you must invoke the function once in a test # environment so Lambda can capture the snapshot — it is taken at publish time. # The snapshot is tied to the published version; a new deployment produces # a new snapshot from the new version's init phase.

SnapStart has two correctness caveats that every Java engineer must understand before enabling it in production:

  • Uniqueness hooks: Any state that must be unique per execution environment — random seeds, UUIDs generated at init time, TLS session keys — will be identical across all restored instances if generated before the snapshot. Lambda provides the CRaC (Coordinated Restore at Checkpoint) API hooks: implement org.crac.Resource and register with Core.getGlobalContext(). In the beforeCheckpoint hook, close network connections and release any unique state. In the afterRestore hook, re-establish connections and regenerate unique state.
  • Network connections in init code: A TCP connection to RDS, ElastiCache, or an external API opened at init time will be stale after snapshot restore. Either open connections lazily in the handler body, or use the afterRestore CRaC hook to reconnect.
SnapStart does not replace Provisioned Concurrency for burst tolerance: SnapStart reduces per-invocation cold-start latency from seconds to hundreds of milliseconds. But if 500 concurrent requests arrive simultaneously and Lambda must provision 500 new execution environments from snapshots, each of those 500 still incurs the (now-shorter) restore latency. For critical APIs where zero cold starts is the requirement, combine SnapStart (for the backup case) with auto-scaled Provisioned Concurrency (for the normal case). SnapStart handles the overflow; PC handles the hot path.

Memory, Timeout, and Architecture: The Other Knobs

Cold starts correlate with function memory configuration. Lambda CPU allocation is proportional to memory: a 128 MB function gets a fraction of a vCPU; a 1769 MB function gets exactly 1 vCPU; a 3008 MB function gets close to 2. For JVM functions in particular, more memory means faster classloading, faster JIT compilation, and therefore shorter cold starts. The sweet spot for Java cold-start reduction without burning budget is typically 1024–2048 MB.

For ARM64 (Graviton2) functions, cold starts are measurably shorter than x86_64 for equivalent memory settings, and the per-invocation compute cost is 20 % lower. Unless you have a specific reason to stay on x86_64 (native extensions, architecture-specific libraries), new functions should default to architectures = ["arm64"] in Terraform.

Finally, function package size directly affects download time during cold start. A 50 MB ZIP has a shorter download window than a 250 MB ZIP. Keep dependencies minimal; use Lambda Layers for shared libraries so the layer is cached at the availability zone level rather than downloaded per function version.