Serverless & Event-Driven Operations

Serverless Observability

18 min Lesson 7 of 28

Serverless Observability

Traditional observability assumes long-lived processes: you attach an agent, stream metrics over time, and correlate traces across services running in containers or VMs you control. Serverless shatters those assumptions. A Lambda function lives for milliseconds to minutes, spawns in a managed micro-VM you never see, and may run hundreds of concurrent copies simultaneously. You cannot SSH into it, you cannot attach a profiler, and the platform discards its local state the moment the invocation ends. Yet you still need to know exactly what happened, why it was slow, and what it cost. Serverless observability is the discipline of building that visibility from the outside in — through structured logs, distributed traces, and cost-aware metrics — without modifying the underlying platform.

The Three Pillars in a Serverless Context

Logs are your primary signal. CloudWatch Logs ingests every line written to stdout/stderr from a Lambda invocation. But raw print statements produce unstructured text that is expensive to query and nearly impossible to correlate across invocations at scale. The production baseline is structured JSON logging — one JSON object per log event, every event carrying a consistent set of fields so downstream tooling can index, filter, and aggregate without parsing.

Traces give you the request path across function boundaries. A single user action may invoke five Lambda functions, touch two DynamoDB tables, put a message on SQS, and call a third-party API. Without distributed tracing you see the outcome but cannot identify which hop added 800 ms. AWS X-Ray and OpenTelemetry are the two dominant approaches; they are increasingly interoperable.

Metrics at the Lambda layer are largely free from CloudWatch: invocations, errors, duration, throttles, concurrent executions, and iterator age (for stream sources). The gap is business metrics and cost metrics — those you must emit explicitly.

Structured Logging at Production Scale

The canonical approach for Python Lambda is AWS Lambda Powertools. Powertools' Logger injects a standard envelope (function name, version, cold start flag, request ID, correlation ID) on every log record automatically. You add domain fields; the library handles serialisation and log level filtering.

# requirements.txt
aws-lambda-powertools==2.38.0

# handler.py
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.metrics import MetricUnit

logger  = Logger(service="order-processor")
tracer  = Tracer(service="order-processor")
metrics = Metrics(namespace="ECommerce", service="order-processor")

@logger.inject_lambda_context(log_event=True, correlation_id_path="headers.X-Correlation-Id")
@tracer.capture_lambda_handler
@metrics.log_metrics(capture_cold_start_metric=True)
def handler(event, context):
    order_id = event["detail"]["orderId"]
    logger.info("Processing order", extra={"order_id": order_id, "amount": event["detail"]["amount"]})

    try:
        result = process_order(order_id)
        metrics.add_metric(name="OrdersProcessed", unit=MetricUnit.Count, value=1)
        logger.info("Order processed successfully", extra={"order_id": order_id, "duration_ms": result.elapsed_ms})
        return {"statusCode": 200}
    except InsufficientInventoryError as e:
        logger.warning("Inventory shortfall", extra={"order_id": order_id, "sku": e.sku})
        metrics.add_metric(name="InventoryShortfalls", unit=MetricUnit.Count, value=1)
        return {"statusCode": 409}
    except Exception as e:
        logger.exception("Unhandled error processing order", extra={"order_id": order_id})
        raise

Every record emitted by this handler is valid JSON with fields like level, message, service, function_name, cold_start, correlation_id, and any extras you append. CloudWatch Logs Insights can then query across millions of invocations in seconds using those field names directly.

Key practice: Always include a correlation_id (or trace_id) that propagates from the first entry point — API Gateway, EventBridge, or SQS — through every downstream function. Without it, reconstructing a multi-function flow in CloudWatch Logs is guesswork. Powertools injects this from a configurable JMESPath header automatically.

Distributed Tracing with X-Ray and OpenTelemetry

AWS X-Ray is the native choice. The SDK auto-instruments the AWS SDK for Python/Node/Java so every DynamoDB call, S3 put, and SQS send appears as a traced subsegment with duration and fault flags. Lambda itself starts the root segment when X-Ray active tracing is enabled on the function — no SDK init code required.

A single user request traced across two Lambda functions and their downstream subsegments. The third-party API at 64 ms is the latency hotspot.

For teams standardising on OpenTelemetry (OTEL), AWS provides the AWS Distro for OpenTelemetry (ADOT) Lambda layer. It ships the OTEL Collector as a Lambda extension in a sidecar process, collects OTEL spans from your function, and exports to X-Ray, Jaeger, Grafana Tempo, or any OTLP-compatible backend. The ADOT approach avoids vendor lock-in and is increasingly the standard recommendation for greenfield services.

# Enabling X-Ray active tracing via Terraform (all functions in a module)
resource "aws_lambda_function" "order_processor" {
  function_name = "order-processor"
  handler       = "handler.handler"
  runtime       = "python3.12"
  filename      = data.archive_file.lambda_zip.output_path

  tracing_config {
    mode = "Active"   # PassThrough = sampling only; Active = always sample
  }

  environment {
    variables = {
      POWERTOOLS_SERVICE_NAME      = "order-processor"
      POWERTOOLS_TRACER_CAPTURE_RESPONSE = "true"
      LOG_LEVEL                    = "INFO"
    }
  }
}

# ADOT layer (X-Ray + OTLP export) — region-specific ARN
# aws lambda list-layers --compatible-runtime python3.12 | grep ADOT
resource "aws_lambda_layer_version_permission" "adot" {
  layer_name     = "arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-21-0:1"
  # Attach via layers = [...] on the function resource
}

CloudWatch Logs Insights for Operational Investigation

When an incident fires, CloudWatch Logs Insights lets you query structured logs across all invocations in a time window without exporting data. Learning a handful of query patterns means the difference between a 10-minute investigation and a 2-hour grep session.

# Find all ERROR events in the last 30 min with their order_id and correlation_id
fields @timestamp, level, message, order_id, correlation_id, @requestId
| filter level = "ERROR"
| sort @timestamp desc
| limit 50

# P99 duration breakdown per function version (Lambda emits REPORT lines automatically)
filter @type = "REPORT"
| stats
    count()              as invocations,
    avg(duration)        as avg_ms,
    pct(duration, 99)    as p99_ms,
    max(duration)        as max_ms,
    sum(billedDuration)  as total_billed_ms
  by functionVersion
| sort p99_ms desc

# Detect cold starts and their duration overhead
filter @type = "REPORT" and @initDuration > 0
| stats
    count()                as cold_starts,
    avg(@initDuration)     as avg_init_ms,
    pct(@initDuration, 99) as p99_init_ms
  by bin(5m)

# Correlate all log lines for a single trace
fields @timestamp, level, message, order_id
| filter correlation_id = "req-abc123"
| sort @timestamp asc

Production pattern: Set a CloudWatch Logs Insights saved query for your top three operational investigations (error rate, cold start overhead, p99 by version). Pin them to your team's CloudWatch dashboard. During an incident you want those queries running in under 30 seconds, not written from memory under pressure.

Cost-Aware Monitoring

Serverless billing is direct: you pay for GB-seconds (memory allocated × duration in seconds) plus per-invocation cost. Unlike EC2 where cost is a fixed monthly bill, Lambda cost is a real-time function of your code's efficiency. A 10% regression in average duration is a direct 10% cost increase. At scale this matters: a function invoked 50 million times per day at 128 MB and 200 ms average costs roughly $130/month; a memory regression to 256 MB at 250 ms costs $653/month — five times more, from a single deployment.

The practical levers are:

Memory sizing with AWS Lambda Power Tuning: an open-source Step Functions state machine that runs your function at every memory configuration (128 MB to 10 GB) and plots cost vs. duration. The optimal configuration is rarely the minimum memory — at higher memory Lambda gets proportionally more vCPU, often halving duration and more than offsetting the memory cost increase.
Billed duration metric alarms: create a CloudWatch alarm on the BilledDuration P99 metric per function. A spike in billed duration is your first signal of a performance regression — often before users notice latency.
AWS Cost Anomaly Detection: configure a monitor scoped to the Lambda service. It uses ML to detect cost spikes that deviate from your historical pattern and sends an SNS alert. At large organisations this catches runaway retry loops (a misconfigured SQS trigger retrying thousands of poison-pill messages indefinitely) before the monthly bill lands.

# Lambda Power Tuning — deploy the SAR app once per account, then invoke:
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:order-processor",
    "powerValues": [128, 256, 512, 1024, 1769, 3008],
    "num": 10,
    "payload": "{\"detail\":{\"orderId\":\"test-123\",\"amount\":49.99}}",
    "parallelInvocation": true,
    "strategy": "cost"
  }'

# CloudWatch metric alarm on BilledDuration P99 (Terraform)
resource "aws_cloudwatch_metric_alarm" "billed_duration_p99" {
  alarm_name          = "order-processor-billed-duration-p99"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Duration"
  namespace           = "AWS/Lambda"
  period              = 60
  extended_statistic  = "p99"
  threshold           = 3000    # 3 s — tune to your SLO
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    FunctionName = aws_lambda_function.order_processor.function_name
  }
}

Custom Metrics and Dashboards

Lambda Powertools' Metrics module uses the Embedded Metric Format (EMF) — a CloudWatch Logs JSON schema that CloudWatch automatically extracts into real CloudWatch Metrics at no additional PutMetricData API cost. You emit business metrics (orders processed, inventory shortfalls, payment failures) as structured log lines, and CloudWatch creates the time-series metrics transparently. This is orders of magnitude cheaper than calling put_metric_data directly from every invocation at high throughput.

Production pitfall — Iterator age silent bleed: For Lambda functions consuming Kinesis or DynamoDB Streams, the IteratorAge metric (delay between record creation and Lambda processing) is the most important health signal — and CloudWatch does not alarm on it by default. An iterator age trending from 500 ms to 45 minutes means your function cannot keep up with the stream. Set a CloudWatch alarm on IteratorAge maximum at a threshold of 60 seconds. Without this alarm, a slow processing regression silently builds up a multi-hour backlog before anyone notices.

Observability as a Feedback Loop into Cost and Architecture

At senior level, observability data is not just for debugging — it drives architectural decisions. A trace showing that 70% of your function's wall time is spent waiting on a synchronous DynamoDB call suggests caching or batching. A cost analysis showing that 40% of invocations are sub-100-ms calls that each pay the per-invocation minimum suggests consolidating into fewer, longer-running functions or switching the trigger to SQS batching. The cold-start P99 from your Logs Insights queries tells you whether provisioned concurrency is worth its cost for a given function. In serverless, your observability data is your capacity planning data — the two are inseparable.