Resilience, Messaging & Observability

Rate Limiting

18 min Lesson 4 of 12

Rate Limiting

A circuit breaker protects your service when a downstream dependency fails. Rate limiting solves the complementary problem: protecting your own service from being overwhelmed by too many incoming requests, whether from legitimate traffic spikes, misbehaving clients, or deliberate denial-of-service attempts. Without rate limiting, a single runaway client can saturate your thread pool, exhaust your database connection pool, and take down a service that would otherwise have been perfectly healthy.

Why Rate Limiting Belongs in the Resilience Toolkit

In a distributed system, individual microservices are exposed — directly or through an API gateway — to many callers at once. Each service has finite resources: CPU, memory, database connections, and outbound network capacity. Rate limiting enforces a contract: no single client (or class of clients) may consume more than a defined share of those resources per unit of time. The benefits go beyond raw availability:

Fairness: one slow or greedy client cannot starve others.
Cost control: cloud resources scale with traffic; unbounded load means unbounded cost.
Security: brute-force, credential stuffing, and scraping attacks all rely on high request volume — a tight rate limit makes them impractical.
Predictable SLOs: when you know the maximum arrival rate, you can size your infrastructure confidently.

Common Rate-Limiting Algorithms

Before writing code it helps to understand the algorithms, because they have meaningfully different behaviour under bursty traffic:

Fixed Window: count requests in a calendar-aligned window (e.g. 100 requests per minute, reset at :00). Simple, but allows a 2× burst at window boundaries — 100 at 0:59 and 100 more at 1:00.
Sliding Window (log or counter): considers the rolling past N seconds. Smoother, eliminates the boundary burst, but costs more memory.
Token Bucket: a bucket of capacity N fills at a fixed rate; each request consumes one token. Allows natural short bursts up to bucket capacity, then enforces the average rate. The most widely used algorithm in practice.
Leaky Bucket: requests enter a queue and drain at a fixed rate. Smooths bursts entirely — useful for protecting downstream services that cannot absorb spikes.

Token bucket is the default in Resilience4j and most API gateways. It permits short bursts (good user experience on retries) while enforcing a long-run average. Use leaky bucket only when you need perfectly metered outbound calls to a third-party API with strict per-second limits.

Rate Limiting with Resilience4j

Resilience4j ships a RateLimiter module that integrates naturally with the rest of its resilience primitives. Add the Spring Boot starter to your pom.xml:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>

Configure the rate limiter in application.yml:

resilience4j:
  ratelimiter:
    instances:
      orderService:
        limit-for-period: 50          # tokens replenished each refresh period
        limit-refresh-period: 1s      # how often the token bucket refills
        timeout-duration: 0ms         # how long a thread waits for a token (0 = fail-fast)

Apply it to your service method with the @RateLimiter annotation:

import io.github.resilience4j.ratelimiter.annotation.RateLimiter;
import org.springframework.stereotype.Service;

@Service
public class OrderService {

    @RateLimiter(name = "orderService", fallbackMethod = "rateLimitFallback")
    public OrderResponse placeOrder(OrderRequest request) {
        // actual business logic — DB write, downstream call, etc.
        return orderRepository.save(request.toEntity()).toResponse();
    }

    // Called automatically when the rate limit is exceeded
    private OrderResponse rateLimitFallback(OrderRequest request,
                                             io.github.resilience4j.ratelimiter.RequestNotPermitted ex) {
        throw new ResponseStatusException(
            HttpStatus.TOO_MANY_REQUESTS,
            "Order rate limit exceeded. Please retry in a moment."
        );
    }
}

Always set timeout-duration: 0ms in synchronous REST services. A non-zero timeout causes the calling thread to block waiting for a token. Under sustained overload every thread is blocked, your thread pool fills, and the service is effectively down — exactly what you were trying to prevent. Fail fast, return 429 Too Many Requests, and let the client back off.

Exposing a Proper 429 Response

The HTTP specification reserves status code 429 Too Many Requests for rate-limit responses. Well-behaved clients and API gateways understand this code. Include a Retry-After header so clients know when to try again:

import io.github.resilience4j.ratelimiter.RequestNotPermitted;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.RestControllerAdvice;

@RestControllerAdvice
public class RateLimitExceptionHandler {

    @ExceptionHandler(RequestNotPermitted.class)
    public ResponseEntity<ErrorBody> handleRateLimit(RequestNotPermitted ex) {
        HttpHeaders headers = new HttpHeaders();
        headers.set(HttpHeaders.RETRY_AFTER, "1"); // seconds until bucket refills

        return ResponseEntity
            .status(HttpStatus.TOO_MANY_REQUESTS)
            .headers(headers)
            .body(new ErrorBody("rate_limit_exceeded",
                                "You have exceeded the allowed request rate."));
    }
}

Per-User vs. Global Rate Limiting

A single global rate limiter protects the service as a whole, but it does not prevent one user from monopolising the shared budget. Production systems almost always combine both layers:

Global limiter — guards total service throughput (e.g. 10 000 req/s across all callers). Sits at the API gateway or load balancer level.
Per-user / per-key limiter — enforces fairness (e.g. 100 req/s per API key). Implemented in the service or at the gateway with a distributed store.

Resilience4j's in-process RateLimiter is not partitioned by caller by default. For per-user limiting you need a shared counter store — typically Redis — so that all instances of a horizontally scaled service share the same count. Spring Cloud Gateway has built-in Redis rate limiting via RequestRateLimiter filter:

# application.yml for Spring Cloud Gateway
spring:
  cloud:
    gateway:
      routes:
        - id: order-service
          uri: lb://order-service
          predicates:
            - Path=/api/orders/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 10   # tokens added per second
                redis-rate-limiter.burstCapacity: 20   # max burst
                redis-rate-limiter.requestedTokens: 1
                key-resolver: "#{@userKeyResolver}"    # Spring bean that extracts the key

import org.springframework.cloud.gateway.filter.ratelimit.KeyResolver;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import reactor.core.publisher.Mono;

@Configuration
public class RateLimitConfig {

    // Partition by the "X-User-Id" header; fall back to IP address
    @Bean
    public KeyResolver userKeyResolver() {
        return exchange -> {
            String userId = exchange.getRequest().getHeaders()
                                    .getFirst("X-User-Id");
            if (userId != null) return Mono.just(userId);
            return Mono.just(
                exchange.getRequest().getRemoteAddress()
                        .getAddress().getHostAddress()
            );
        };
    }
}

Do not rely solely on client IP for rate-limiting in production. IP addresses are shared (NAT, corporate proxies) and spoofable. Use an authenticated identifier — API key, JWT subject, user ID — as the primary partition key. IP-based limiting is a useful secondary layer for unauthenticated endpoints (login, registration) where no authenticated identity exists yet.

Rate Limiting as a Security Control

Rate limiting is not only an availability mechanism — it is a first-line security control. Consider these attack patterns and how rate limiting mitigates them:

Credential stuffing: attackers replay millions of stolen username/password pairs. A limit of 5 failed logins per IP per minute makes this practically infeasible without a massive botnet.
SMS / email OTP abuse: sending verification codes costs money. Rate-limit code-sending endpoints aggressively (e.g. 3 per hour per phone number).
Data scraping: competitors or bots hit your search or catalogue endpoints at machine speed. Per-user rate limiting combined with captcha on threshold breach effectively blocks automated scraping.
Resource-intensive endpoints: endpoints that trigger heavy computation (report generation, bulk export) should carry a much tighter limit than simple reads.

Observability: Monitoring Your Rate Limiter

A rate limiter you cannot observe is a rate limiter you cannot tune. Resilience4j automatically exposes metrics via Micrometer. Key metrics to watch:

resilience4j.ratelimiter.available.permissions — current token count; drops to zero under load.
resilience4j.ratelimiter.waiting.threads — threads blocked waiting for a token (should be zero with timeout-duration: 0ms).

Set up alerts when available.permissions is consistently zero — that is the signal your limit is too tight for legitimate traffic and you need to either raise the limit or optimise the hot path.

Summary

Rate limiting is the contract between your service and its callers: a defined ceiling on request rate that protects resources, ensures fairness, and hardens the service against abuse. In Spring Boot services, Resilience4j's @RateLimiter with timeout-duration: 0ms provides fast, observable in-process limiting. For distributed, per-user limiting across a horizontally scaled fleet, Spring Cloud Gateway's Redis-backed RequestRateLimiter is the standard tool. Always return 429 Too Many Requests with a Retry-After header, and monitor token availability in your dashboards. In the next lesson we move from protecting inbound traffic to designing for asynchrony with messaging systems.