Capacity Planning & Autoscaling

Queue-Based & Event-Driven Scaling

18 min Lesson 6 of 27

Queue-Based & Event-Driven Scaling

Kubernetes HPA scales on CPU and memory — signals that measure current work being processed. For event-driven workloads the signal that actually matters is pending work not yet processed: the depth of a Kafka topic partition, the length of a RabbitMQ queue, the lag of an SQS consumer group. CPU and memory are lagging indicators here; queue depth is the leading one. KEDA (Kubernetes Event-Driven Autoscaler) closes this gap by wiring arbitrary external metrics directly into the Kubernetes HorizontalPodAutoscaler machinery. It was donated to CNCF in 2020 and is now CNCF Graduated — every major cloud and on-prem stack supports it.

How KEDA Works: Architecture Under the Hood

KEDA installs two components into your cluster. The KEDA Operator watches ScaledObject and ScaledJob custom resources and translates them into Kubernetes-native HorizontalPodAutoscaler objects — it does not bypass HPA, it drives it. The Metrics Adapter is an implementation of the Kubernetes external.metrics.k8s.io API: it polls your external source (Kafka, Redis, SQS, Prometheus, Azure Service Bus, …) on a configurable interval and exposes the metric value so HPA can act on it like any other metric.

This design matters: kubectl get hpa still shows the KEDA-managed scaler with EXTERNAL metric type. Standard Kubernetes RBAC, PodDisruptionBudgets, and minReplicas/maxReplicas guards all apply. KEDA adds scale-to-zero (HPA cannot go below 1 natively) and scale-from-zero capabilities — when no messages exist, the Deployment drops to 0 replicas and KEDA's own polling loop wakes it when the first message arrives.

KEDA polls external sources via the Metrics Adapter and drives the native Kubernetes HPA — including scale-to-zero.

Installing KEDA

Helm is the standard installation path. KEDA runs in its own keda namespace and registers the metrics server extension API. Pin the version in production — chart version 2.x maps to KEDA operator 2.x (they track together).

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.14.0 \
  --set prometheus.metricServer.enabled=true \
  --set prometheus.operator.enabled=true

# Verify all three KEDA pods are Running
kubectl get pods -n keda
# NAME                                      READY   STATUS    RESTARTS
# keda-operator-6c9b7d8f9c-x4pvk           1/1     Running   0
# keda-operator-metrics-apiserver-xxx       1/1     Running   0
# keda-webhooks-xxx                         1/1     Running   0

Scaling on Kafka Consumer Lag

The canonical KEDA use case at scale: a Kafka consumer group that processes events has falling-behind partitions. You want replicas proportional to the total lag across partitions, with a target lag per replica of say 1,000 messages — so at 50,000 unprocessed messages you expect ~50 replicas.

# ScaledObject for a Kafka consumer deployment
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: payments
spec:
  scaleTargetRef:
    name: order-processor          # the Deployment to scale
  pollingInterval: 15              # check every 15 s (default 30)
  cooldownPeriod: 60               # wait 60 s after last scale-down event
  minReplicaCount: 2               # never below 2 (keep warm; 0 = scale-to-zero)
  maxReplicaCount: 80              # hard ceiling matches your node budget
  advanced:
    restoreToOriginalReplicaCount: false
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30   # fast scale-up
          policies:
          - type: Percent
            value: 100
            periodSeconds: 30
        scaleDown:
          stabilizationWindowSeconds: 300  # slow scale-down (5 min)
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-brokers.kafka.svc:9092
      consumerGroup: order-processor-group
      topic: orders.created
      lagThreshold: "1000"         # messages per replica target
      offsetResetPolicy: latest
      # TLS + SASL for production:
      sasl: plaintext
      tls: enable
    authenticationRef:
      name: kafka-trigger-auth     # TriggerAuthentication referencing a Secret

lagThreshold math: KEDA calculates desiredReplicas = ceil(totalLag / lagThreshold). With 47,300 messages and a threshold of 1,000 you get ceil(47.3) = 48. This is deterministic and easy to reason about during incident reviews.

Scaling on SQS Queue Depth

SQS is the most common trigger in AWS-native shops. KEDA uses the aws-sqs-queue scaler, which calls GetQueueAttributes to read ApproximateNumberOfMessages. You need a TriggerAuthentication or an IRSA annotation — prefer IRSA in EKS to avoid long-lived credentials in Secrets.

# TriggerAuthentication using IRSA (EKS pod identity)
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: sqs-trigger-auth
  namespace: processing
spec:
  podIdentity:
    provider: aws-eks              # uses the pod's IRSA annotation

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: image-resize-scaler
  namespace: processing
spec:
  scaleTargetRef:
    name: image-resizer
  minReplicaCount: 0               # true scale-to-zero: no idle cost
  maxReplicaCount: 200
  pollingInterval: 20
  cooldownPeriod: 120
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/image-resize
      queueLength: "50"            # messages per replica
      awsRegion: us-east-1
      identityOwner: operator
    authenticationRef:
      name: sqs-trigger-auth

ScaledJob for Batch Workloads

For tasks that are run-to-completion (video transcoding, ML inference batch, report generation) use ScaledJob instead of ScaledObject. KEDA creates a fresh Kubernetes Job for each unit of work up to maxReplicaCount, then cleans up completed jobs. This avoids the thundering-herd problem where a single long-polling consumer blocks later messages.

ScaledJob vs ScaledObject: Use ScaledObject for long-running consumer processes (your normal Kafka worker deployment). Use ScaledJob when each message maps to a discrete, bounded task — especially if processing time is highly variable. Jobs give you better fan-out parallelism and automatic cleanup.

Production Failure Modes & Mitigations

KEDA at scale surfaces several non-obvious failure modes that you must account for before going to production:

Metrics Adapter crash loops back to 1 replica: If the KEDA metrics server becomes unavailable, HPA cannot fetch the external metric and falls back to the last known value — which may be stale. Design your consumer to be idempotent and your topic retention to cover the KEDA restart window (usually seconds, never minutes with proper liveness probes).
Scale-to-zero cold start latency: Going from 0 to 1 replica takes time: scheduling, image pull (if not cached), and your app startup. During that gap messages accumulate. For latency-sensitive paths keep minReplicaCount: 1 and accept the idle cost. For cost-sensitive batch jobs, set pollingInterval: 5 and pre-pull images with a DaemonSet or node image cache (Karpenter supports amiFamily pre-pulls).
Partition count ceiling: Kafka cannot scale consumers past the partition count. 10 partitions = 10 maximum parallel consumers regardless of lag. Scale partitions before you need them — Kafka partition increase is non-reversible and requires a rebalance. A common production target is partitions = 3 × maxReplicas to leave room for rebalancing headroom.
ScaledObject deletion deletes the HPA: KEDA owns the HPA object. If you delete the ScaledObject in an incident (to stop autoscaling), the HPA is also deleted and your Deployment drops to its base replica count immediately. Use kubectl scale deployment to manually override instead of deleting the ScaledObject.
Trigger authentication Secret rotation: If the Secret referenced by TriggerAuthentication is rotated and the KEDA Operator has cached the old value, scaling will silently fail (metrics return 0, replicas collapse). Monitor KEDA operator logs with the metric keda_scaler_errors_total and alert on any non-zero rate.

Do not set maxReplicaCount without node budget math. KEDA will happily request 500 pods. If your Cluster Autoscaler or Karpenter cannot provision nodes fast enough, pods stay Pending and messages keep accumulating. Always align maxReplicaCount with the node budget you covered in Lesson 5, and set a Kubernetes ResourceQuota on the namespace as a hard ceiling.

Observability for KEDA Scalers

KEDA exposes Prometheus metrics from the metrics adapter. Scrape the keda-operator-metrics-apiserver service on port 8080. Key metrics to alert on:

keda_scaler_metrics_value — the current external metric value (queue depth, lag). Graph this alongside replica count to verify the scaler is responsive.
keda_scaler_errors_total — any non-zero rate means the trigger cannot read its source. Alert immediately.
keda_scaled_object_paused — set to 1 when a ScaledObject is paused (useful during maintenance windows; you can pause via annotation).

At Google-scale, the standard practice is to deploy a Grafana dashboard per ScaledObject showing: queue depth trend, replica count over time, scale event annotations, and consumer throughput (messages/s processed). This dashboard is your primary tool during capacity reviews (Lesson 9) and incident response.