Capacity Planning & Autoscaling

Cloud Autoscaling Beyond Kubernetes

18 min Lesson 7 of 27

Cloud Autoscaling Beyond Kubernetes

Kubernetes HPA and Cluster Autoscaler are powerful tools, but a significant portion of production workloads — serverless functions, managed databases, legacy EC2 fleets, App Engine services, and Azure App Service deployments — scale entirely outside the Kubernetes plane. Understanding how cloud-native autoscaling mechanisms work at the infrastructure level is essential for senior engineers designing resilient, cost-efficient systems that span managed services, bare VMs, and container orchestration simultaneously.

This lesson covers the three pillars of cloud autoscaling that operate independently of Kubernetes: dynamic scaling policies (reactive, metric-driven), scheduled scaling (time-based, predictable load), and predictive scaling (ML-driven, anticipatory). We use AWS Auto Scaling Groups (ASG) as the canonical reference — GCP Managed Instance Groups and Azure VM Scale Sets follow the same conceptual model with provider-specific syntax.

Auto Scaling Groups: The Anatomy

An ASG wraps a fleet of EC2 instances with a desired/minimum/maximum capacity envelope and a set of scaling policies. Every scaling action — whether triggered by a CloudWatch alarm, a schedule expression, or a predictive model — ultimately writes to the same DesiredCapacity field. The ASG controller then reconciles actual running instances toward that target, launching or terminating instances across configured Availability Zones.

At big-tech scale, ASGs typically sit behind a Network Load Balancer (NLB) or Application Load Balancer (ALB) target group. When a new instance passes its health checks, the ASG registers it with the target group. Instance warm-up time — the window between "instance started" and "instance accepting traffic" — is a critical parameter that almost all teams get wrong the first time.

An ASG with five instances: three warm and in-service, one warming up, one still launching — all managed by CloudWatch-driven scaling policies.

Dynamic Scaling Policies

AWS offers three reactive policy types; production systems typically layer all three for defense in depth.

Target Tracking Scaling

The simplest and most recommended starting policy. You declare a target value for a metric — CPU at 60%, request count per target at 1000 RPS, or SQS queue depth per instance — and the ASG continuously adjusts capacity to maintain that target. Under the hood AWS runs a PID-style control loop.

# Attach a target-tracking policy for average CPU at 60%
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-api-asg \
  --policy-name cpu-target-60 \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ASGAverageCPUUtilization"
    },
    "TargetValue": 60.0,
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60,
    "DisableScaleIn": false
  }'

# Custom metric: ALBRequestCountPerTarget (requires LoadBalancer ARN suffix)
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-api-asg \
  --policy-name rps-target-1000 \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "CustomizedMetricSpecification": {
      "MetricName": "RequestCountPerTarget",
      "Namespace": "AWS/ApplicationELB",
      "Dimensions": [
        {"Name":"TargetGroup","Value":"targetgroup/my-tg/abc123"}
      ],
      "Statistic": "Sum",
      "Unit": "None"
    },
    "TargetValue": 1000.0,
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 30
  }'

Scale-out cooldown vs. scale-in cooldown: Keep scale-out cooldowns short (30–60 s) so you react quickly to traffic spikes. Keep scale-in cooldowns long (300–600 s) so you do not terminate instances that just came healthy and started warming caches. The asymmetry is intentional — the cost of scaling out a few extra instances is far less than the cost of a latency cliff from aggressive scale-in.

Step Scaling

Step scaling lets you define a piecewise function: if CPU breaches 70%, add 2 instances; if it breaches 85%, add 5; if it breaches 95%, add 10. This is the policy to reach for when your load profile is bursty and you know from experience that target-tracking's gradual adjustments are too slow. You trigger it from a CloudWatch alarm, not from the metric directly.

# 1. Create the CloudWatch alarm
aws cloudwatch put-metric-alarm \
  --alarm-name asg-cpu-high \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=AutoScalingGroupName,Value=my-api-asg \
  --statistic Average \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 70 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:autoscaling:us-east-1:123456789:scalingPolicy:abc:autoScalingGroupName/my-api-asg:policyName/step-scale-out

# 2. Attach the step scaling policy
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-api-asg \
  --policy-name step-scale-out \
  --policy-type StepScaling \
  --adjustment-type ChangeInCapacity \
  --step-adjustments \
    '[{"MetricIntervalLowerBound":0,"MetricIntervalUpperBound":15,"ScalingAdjustment":2},
      {"MetricIntervalLowerBound":15,"MetricIntervalUpperBound":25,"ScalingAdjustment":5},
      {"MetricIntervalLowerBound":25,"ScalingAdjustment":10}]' \
  --cooldown 90

Alarm evaluation-periods trap: A single-period alarm on a 60-second metric with a 1-minute evaluation window means you can trigger scaling on a single noisy data point. Use at least 2 evaluation periods (2 minutes of sustained breach) for CPU alarms. For latency-sensitive workloads where you want a faster reaction, reduce the metric period to 10 seconds, but keep 3–5 evaluation periods to filter spikes from sustained load.

Scheduled Scaling

For workloads with predictable load cycles — business-hours API traffic, end-of-day batch jobs, weekly marketing email blasts, daily market-open surges — scheduled scaling is more reliable and cheaper than reactive scaling. You set the desired/min/max capacity at a specific time using cron expressions; no metric, no alarm, no lag.

# Pre-scale up before morning traffic (UTC 07:45 = ~2:45 AM EST business hours)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-api-asg \
  --scheduled-action-name morning-scale-up \
  --recurrence "45 7 * * MON-FRI" \
  --min-size 10 \
  --max-size 50 \
  --desired-capacity 15

# Scale back down after evening traffic drop
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-api-asg \
  --scheduled-action-name evening-scale-down \
  --recurrence "0 23 * * MON-FRI" \
  --min-size 2 \
  --max-size 50 \
  --desired-capacity 5

# One-time pre-scale for a product launch (ISO 8601 timestamp)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-api-asg \
  --scheduled-action-name launch-day-boost \
  --start-time "2025-09-15T12:00:00Z" \
  --min-size 30 \
  --max-size 100 \
  --desired-capacity 50

A critical subtlety: a scheduled action only changes DesiredCapacity at the scheduled moment. If your reactive policies have already scaled you above the new desired value, the scheduled action will scale you down at that time. Always reason about the interaction between scheduled actions and live policy state — especially for the scale-down half.

Timezone awareness: AWS scheduled actions always interpret cron expressions in UTC. Document this explicitly in your runbooks — a scheduled 08:00 pre-scale might need to be 13:00 UTC for a London-primary service, but 14:00 UTC during BST. Use time-zone-aware tooling (Terraform aws_autoscaling_schedule with explicit UTC offsets) and alert on clock drift in your CI/CD pipeline.

Predictive Scaling

AWS Predictive Scaling, launched in GA in 2021, uses machine learning trained on up to 14 days of your ASG's historical CloudWatch metrics to forecast future load and proactively adjust capacity before the load arrives. The critical difference from scheduled scaling is that it learns and adapts — you do not need to maintain a schedule when your pattern shifts.

Predictive Scaling works in two modes. In Forecast Only mode, it generates forecasts and displays them in the console without taking action — useful for building confidence before enabling automation. In Forecast and Scale mode, it creates scheduled actions automatically, typically 1 hour in advance of a predicted load increase.

# Enable predictive scaling via Terraform (recommended for IaC)
resource "aws_autoscaling_policy" "predictive" {
  name                   = "predictive-cpu"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    mode                          = "ForecastAndScale"
    scheduling_buffer_time        = 300  # 5-minute buffer before forecast
    max_capacity_breach_behavior  = "IncreaseMaxCapacity"
    max_capacity_buffer           = 10   # allow 10% over max during spikes

    metric_specification {
      target_utilization = "60"

      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = ""
      }

      predefined_scaling_metric_specification {
        predefined_metric_type = "ASGAverageCPUUtilization"
        resource_label         = ""
      }
    }
  }
}

The scheduling_buffer_time of 300 seconds tells the system to schedule the scale-out 5 minutes before the predicted peak — critical when instance launch plus warm-up takes 3–4 minutes. The max_capacity_breach_behavior set to IncreaseMaxCapacity allows predictive scaling to temporarily exceed your configured max during unexpected spikes, avoiding a hard ceiling that could block scale-out at exactly the wrong moment.

Combining All Three: The Production Pattern

At companies running significant traffic, all three mechanisms operate simultaneously in a hierarchy. Predictive scaling handles the baseline forecast, keeping your fleet pre-warmed. Scheduled scaling covers known discrete events — product launches, marketing blasts, market-open windows — that are too sharp and specific for the ML model. Dynamic target-tracking absorbs the residual variance that neither scheduled nor predictive anticipated. Step scaling acts as a circuit-breaker for sudden, extreme spikes.

Warm-up is your hidden bottleneck: At Google, Amazon, and Meta scale, the instance warm-up period — the time between "instance started" and "instance fully caching, JIT-warmed, and accepting production traffic without elevated error rates" — is often 3–8 minutes for JVM services and 1–2 minutes for Go/Node services. Your DefaultInstanceWarmup must be set to this real measured value, not the default of 300 seconds. Get this wrong and your CloudWatch metrics include not-yet-warm instances, making your scaling signals noisy and causing premature scale-in.

Equivalent Services on Other Clouds

GCP Managed Instance Groups use Autoscaler resources with autoscalingPolicy blocks — the same conceptual trinity of cpuUtilization target, loadBalancingUtilization, and schedules. Azure VM Scale Sets use Autoscale settings with rules (metric-based) and fixedDate or recurrence profiles for scheduled scaling. Azure does not yet have a native equivalent of AWS Predictive Scaling; Azure Monitor Predictive Autoscale is in preview as of 2025. For multi-cloud fleets, Terraform abstracts the provider differences into the same IaC workflow, though the semantics of cooldowns and evaluation windows differ enough to warrant per-cloud tuning.