PCSalt
YouTube GitHub
Back to Architecture
Architecture · 3 min read

Designing for Failure — Retry, Timeout, Bulkhead & Fallback Patterns

Build resilient systems with failure patterns — retries with backoff, timeouts, circuit breakers, bulkheads, and fallbacks. Practical examples with Resilience4j.


In distributed systems, failure is not an edge case — it’s a normal operating condition. Networks partition. Services crash. Databases run slow. The question isn’t whether these things happen, but how your system behaves when they do.

The patterns in this post turn uncontrolled failures into controlled degradation. Instead of cascading crashes, your system retries, times out, falls back, and recovers.

The five resilience patterns

PatternWhat it doesWhen to use
RetryTry the operation againTransient failures (network blip)
TimeoutFail fast after a deadlinePrevent indefinite waiting
Circuit BreakerStop calling a failing servicePrevent cascading failure
BulkheadIsolate componentsPrevent one failure from exhausting all resources
FallbackReturn a degraded responseMaintain partial functionality

Retry with exponential backoff

Not all failures are permanent. A network glitch, a brief database overload, or a momentary service restart — retrying fixes these.

// Without retry
String result = callExternalService(); // fails once, game over

// With retry (Resilience4j)
RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(500))
    .retryExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class) // don't retry business errors
    .build();

Retry retry = Retry.of("externalService", config);

String result = Retry.decorateSupplier(retry, () -> callExternalService()).get();

Exponential backoff

Fixed retry intervals can overwhelm a recovering service. Exponential backoff increases the delay between retries:

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialBackoff(
        Duration.ofMillis(500),  // initial wait
        2.0                      // multiplier
    ))
    .build();
// Retry 1: 500ms, Retry 2: 1000ms, Retry 3: 2000ms

With jitter

If 1000 clients retry at the same intervals, they all hit the service simultaneously (thundering herd). Add jitter:

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(500),
        2.0,
        0.5  // randomization factor
    ))
    .build();
// Retry delays are randomized ±50%

When NOT to retry

  • Non-transient errors: 400 Bad Request, 404 Not Found, validation errors
  • Non-idempotent operations: If the first call might have succeeded (payment charge), retrying could duplicate it
  • Already timed out: If the original request timed out for the caller, retrying is pointless

Timeout

A service call without a timeout can block forever. If the downstream service is stuck, your thread is stuck, your thread pool fills up, and your service stops responding.

TimeLimiterConfig config = TimeLimiterConfig.custom()
    .timeoutDuration(Duration.ofSeconds(3))
    .build();

TimeLimiter timeLimiter = TimeLimiter.of("externalService", config);

CompletableFuture<String> future = CompletableFuture.supplyAsync(() ->
    callSlowService()
);

String result = timeLimiter.executeFutureSupplier(() -> future);
// Throws TimeoutException after 3 seconds

Setting timeout values

DependencySuggested timeout
Database query3–5 seconds
HTTP call to internal service2–5 seconds
HTTP call to external API5–10 seconds
Complex computationDepends on SLA

Set timeouts based on the dependency’s expected response time + buffer. If a service normally responds in 100ms, a 3-second timeout catches degradation without being too aggressive.

Circuit breaker

A circuit breaker prevents calling a service that’s known to be failing. It has three states:

CLOSED (normal) → failures exceed threshold → OPEN (reject all calls)

                                              wait duration

                                              HALF_OPEN (try one call)

                                            success → CLOSED
                                            failure → OPEN
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)           // open after 50% failures
    .slidingWindowSize(10)              // in the last 10 calls
    .waitDurationInOpenState(Duration.ofSeconds(30))  // stay open for 30s
    .permittedNumberOfCallsInHalfOpenState(3)  // allow 3 test calls
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);

String result = CircuitBreaker.decorateSupplier(circuitBreaker, () ->
    callExternalService()
).get();
// Throws CallNotPermittedException when circuit is OPEN

Why it matters

Without a circuit breaker, if Service B is down:

  1. Every request to Service A calls Service B
  2. Each call waits for timeout (3 seconds)
  3. Service A’s thread pool fills up
  4. Service A becomes unresponsive
  5. Service C, which calls Service A, also fills up
  6. Cascade failure

With a circuit breaker:

  1. After 5 failures in 10 calls, circuit opens
  2. Subsequent calls fail immediately (no timeout wait)
  3. Service A remains responsive
  4. After 30 seconds, one test call checks if Service B recovered

Bulkhead

A bulkhead limits the number of concurrent calls to a dependency. Named after ship bulkheads — compartments that prevent a hull breach from sinking the entire ship.

// Thread pool bulkhead
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
    .maxThreadPoolSize(10)
    .coreThreadPoolSize(5)
    .queueCapacity(20)
    .build();

ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("externalService", config);

CompletableFuture<String> result = ThreadPoolBulkhead.decorateSupplier(
    bulkhead, () -> callExternalService()
).get();
// Throws BulkheadFullException when all 10 threads + 20 queue slots are used

Semaphore bulkhead (for virtual threads)

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(25)
    .maxWaitDuration(Duration.ofMillis(500))
    .build();

Bulkhead bulkhead = Bulkhead.of("externalService", config);

Why bulkheads matter

Without a bulkhead, a slow dependency can consume all your threads:

Thread Pool (200 threads):
- Payment Service: slow (using 180 threads, all waiting)
- User Service: fast (only 20 threads available)
- Catalog Service: no threads left → timeout

With bulkheads:

Payment Bulkhead: max 50 threads
User Bulkhead: max 50 threads
Catalog Bulkhead: max 50 threads
General: remaining 50 threads

Even if Payment is slow, it can only use 50 threads.
User and Catalog continue operating normally.

Fallback

When everything else fails, return a degraded response instead of an error:

String result;
try {
    result = callExternalService();
} catch (Exception e) {
    result = getFallbackResponse(); // cached data, default value, etc.
}

With Resilience4j

Supplier<String> decoratedSupplier = Decorators.ofSupplier(() -> callExternalService())
    .withRetry(retry)
    .withCircuitBreaker(circuitBreaker)
    .withFallback(List.of(
        CallNotPermittedException.class,
        TimeoutException.class,
        IOException.class
    ), e -> getFallbackResponse())
    .decorate();

String result = decoratedSupplier.get();

Fallback strategies

StrategyExample
Cached dataReturn the last known good response
Default valueReturn a static default
Degraded responseReturn partial data (skip the failing dependency)
Alternative serviceCall a backup API
Queue for laterAccept the request, process when dependency recovers
// Fallback: cached data
String getFallbackPrice(String productId) {
    return priceCache.get(productId); // last known price
}

// Fallback: degraded response
UserProfile getFallbackProfile(String userId) {
    User user = userService.getUser(userId); // this works
    // Skip recommendations (that service is down)
    return new UserProfile(user, List.of(), List.of());
}

Combining patterns

Patterns work best together. Order matters:

Retry → Circuit Breaker → Bulkhead → Timeout → Fallback
Supplier<String> resilientCall = Decorators
    .ofSupplier(() -> callExternalService())
    .withRetry(retry)                  // retry transient failures
    .withCircuitBreaker(circuitBreaker) // stop calling if failing
    .withBulkhead(bulkhead)            // limit concurrency
    .withFallback(asList(Exception.class), e -> fallbackResponse())
    .decorate();

Spring Boot integration

@Configuration
class ResilienceConfig {

    @Bean
    fun paymentCircuitBreaker(): CircuitBreaker {
        val config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50f)
            .slidingWindowSize(10)
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .build()
        return CircuitBreaker.of("payment", config)
    }

    @Bean
    fun paymentRetry(): Retry {
        val config = RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(500))
            .retryExceptions(IOException::class.java)
            .build()
        return Retry.of("payment", config)
    }
}

@Service
class PaymentService(
    private val circuitBreaker: CircuitBreaker,
    private val retry: Retry,
    private val paymentClient: PaymentClient
) {
    fun processPayment(orderId: String, amount: Double): PaymentResult {
        val decoratedCall = Decorators.ofSupplier {
            paymentClient.charge(orderId, amount)
        }
            .withRetry(retry)
            .withCircuitBreaker(circuitBreaker)
            .withFallback(listOf(Exception::class.java)) {
                PaymentResult.Pending(orderId) // try again later
            }
            .decorate()

        return decoratedCall.get()
    }
}

Monitoring resilience

Track these metrics:

MetricWhy
Retry countHigh retries = unstable dependency
Circuit state changesCLOSED→OPEN = dependency degraded
Timeout rate> 1% = dependency too slow
Bulkhead rejectionsResources exhausted
Fallback invocationsSystem operating in degraded mode

Resilience4j integrates with Micrometer for metrics:

CircuitBreaker circuitBreaker = CircuitBreaker.of("payment", config);
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(circuitBreakerRegistry)
    .bindTo(meterRegistry);

Summary

PatternPreventsTradeoff
RetryTransient failuresIncreased latency, possible duplicates
TimeoutIndefinite blockingMight timeout during normal slow response
Circuit BreakerCascading failuresTemporary unavailability of a feature
BulkheadResource exhaustionLimited concurrency per dependency
FallbackTotal failureDegraded user experience

Design for failure means accepting that dependencies will fail and planning what happens when they do. The goal isn’t zero failures — it’s graceful degradation: the system keeps working, perhaps with reduced functionality, while the failing component recovers.