Resilience, Messaging & Observability

The Circuit Breaker Pattern

18 min Lesson 2 of 12

The Circuit Breaker Pattern

In a distributed system, remote calls can fail. A downstream service might be slow, unreachable, or returning errors. Without a protection mechanism, a single misbehaving dependency can cascade: threads pile up waiting for a timeout, the thread pool is exhausted, and your service dies too — even though your code was perfectly fine. The circuit breaker is the standard solution to this problem.

How a Circuit Breaker Works

A circuit breaker wraps a remote call and tracks its outcomes. It operates in three states:

CLOSED — The normal operating state. Calls pass through. The breaker counts failures within a sliding window.
OPEN — The failure threshold has been breached. The breaker immediately rejects all calls with a CallNotPermittedException, without even trying the remote service. This gives the downstream system time to recover.
HALF_OPEN — After a configurable wait duration the breaker allows a limited number of probe calls through. If they succeed, it transitions back to CLOSED. If they fail, it returns to OPEN.

The key insight: When a downstream service is broken, failing fast is better for the caller than waiting for a timeout. A 30-second timeout on every request with 50 concurrent users means 1,500 seconds of held threads. A circuit breaker converts that to an instant rejection, freeing resources for requests that can succeed.

Resilience4j: The Modern Java Choice

Netflix Hystrix, once the standard, is in maintenance mode. Resilience4j is its lightweight, functional successor, designed for Java 8+ with no external dependencies. Spring Cloud CircuitBreaker provides an abstraction layer, but it is common and acceptable to use the Resilience4j annotations and configuration directly in a Spring Boot 3 application — and that is what we will do here.

Add the starter to your pom.xml:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>
<!-- AOP is required for annotations to work -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

Configuring the Circuit Breaker

Breakers are configured by name in application.yml. Each name corresponds to one or more annotated methods in your code.

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        # Use COUNT_BASED or TIME_BASED sliding window
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10          # evaluate last 10 calls
        failureRateThreshold: 50       # open if >=50% fail
        slowCallDurationThreshold: 2s  # calls > 2s count as slow
        slowCallRateThreshold: 80      # open if >=80% are slow
        waitDurationInOpenState: 10s   # stay OPEN for 10s before probing
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        # Which exceptions count as failures
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - feign.FeignException
        # Which exceptions are IGNORED (not counted)
        ignoreExceptions:
          - com.example.shop.exception.BusinessValidationException

Tune slidingWindowSize with care. A window of 5 opens the breaker after just 3 failures — too trigger-happy for a service that has bursts of legitimate errors. A window of 100 makes the breaker slow to react. For most services, 10–20 with a 50% threshold is a sensible starting point.

Applying the Annotation

Annotate the method that makes the remote call. The name must match an instance key in your YAML. The fallbackMethod is called whenever the breaker is OPEN or the call itself throws a recorded exception.

package com.example.shop.order;

import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import org.springframework.stereotype.Service;
import org.springframework.web.client.RestClient;

@Service
public class PaymentClient {

    private final RestClient restClient;

    public PaymentClient(RestClient.Builder builder) {
        this.restClient = builder
            .baseUrl("http://payment-service")
            .build();
    }

    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    public PaymentResponse charge(PaymentRequest request) {
        return restClient.post()
            .uri("/api/v1/charge")
            .body(request)
            .retrieve()
            .body(PaymentResponse.class);
    }

    // The fallback must have the SAME return type and the same parameters
    // PLUS a Throwable as the last parameter.
    private PaymentResponse paymentFallback(PaymentRequest request, Throwable cause) {
        // Log the cause for observability
        log.warn("Payment service unavailable, returning queued response. Cause: {}", cause.getMessage());
        // Return a safe degraded response
        return PaymentResponse.queued(request.getOrderId());
    }
}

The fallback method signature must match exactly. It must have the same return type as the protected method, the same parameters, and a Throwable appended as the final argument. If the signatures do not match, Resilience4j silently ignores the fallback and rethrows the exception — which can be baffling to debug.

State Transitions in Practice

Resilience4j publishes circuit breaker events you can subscribe to for logging or metrics. In Spring Boot, events are also exposed automatically on the /actuator/circuitbreakers endpoint when spring-boot-starter-actuator is present.

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import jakarta.annotation.PostConstruct;
import org.springframework.stereotype.Component;

@Component
public class CircuitBreakerObserver {

    private final CircuitBreakerRegistry registry;

    public CircuitBreakerObserver(CircuitBreakerRegistry registry) {
        this.registry = registry;
    }

    @PostConstruct
    public void registerListeners() {
        CircuitBreaker cb = registry.circuitBreaker("paymentService");

        cb.getEventPublisher()
            .onStateTransition(event ->
                log.info("Circuit breaker [{}] transitioned from {} to {}",
                    event.getCircuitBreakerName(),
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState()));

        cb.getEventPublisher()
            .onCallNotPermitted(event ->
                log.warn("Call rejected — circuit is OPEN for [{}]",
                    event.getCircuitBreakerName()));
    }
}

Security and Distributed-Systems Trade-offs

Circuit breakers have important security implications that are easy to overlook:

Authentication bypass risk: A fallback that returns a permissive or empty response must never bypass authorization checks. If your fallback silently returns "payment accepted" when the real service is down, an attacker can exploit the outage. Fallbacks should return safe, explicit failure states — never silent success.
Information leakage: The CallNotPermittedException and its message should not be propagated to the API client. Translate it at the controller layer into a generic 503 response.
Consistency: An OPEN breaker that rejects writes while a downstream service recovers can leave your database in a partially-committed state. Design your fallback with idempotency and compensating transactions in mind.

A circuit breaker is not a substitute for timeouts. Configure both: a timeout on the RestClient or Feign client to bound how long a single call waits, and a circuit breaker to stop making calls once a threshold of failures accumulates. They operate at different time scales and complement each other.

Summary

The circuit breaker pattern protects your service from cascading failures by failing fast when a downstream dependency is unhealthy. Resilience4j implements it via a simple annotation (@CircuitBreaker), a named YAML configuration, and an optional fallback method. Key decisions are: sliding window size, failure rate threshold, wait duration, and what constitutes a recordable failure. Pair a circuit breaker with a meaningful fallback, wire up the event publisher for observability, and expose the state via Actuator so your operations team can see real-time health. The next lesson adds retries, timeouts, and bulkheads to complete the resilience toolkit.