Designing for Failure — Retry, Timeout, Bulkhead & Fallback Patterns

In distributed systems, failure is not an edge case — it’s a normal operating condition. Networks partition. Services crash. Databases run slow. The question isn’t whether these things happen, but how your system behaves when they do.

The patterns in this post turn uncontrolled failures into controlled degradation. Instead of cascading crashes, your system retries, times out, falls back, and recovers.

The five resilience patterns

Pattern	What it does	When to use
Retry	Try the operation again	Transient failures (network blip)
Timeout	Fail fast after a deadline	Prevent indefinite waiting
Circuit Breaker	Stop calling a failing service	Prevent cascading failure
Bulkhead	Isolate components	Prevent one failure from exhausting all resources
Fallback	Return a degraded response	Maintain partial functionality

Retry with exponential backoff

Not all failures are permanent. A network glitch, a brief database overload, or a momentary service restart — retrying fixes these.

// Without retry
String result = callExternalService(); // fails once, game over

// With retry (Resilience4j)
RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(500))
    .retryExceptions(IOException.class, TimeoutException.class)
    .ignoreExceptions(BusinessException.class) // don't retry business errors
    .build();

Retry retry = Retry.of("externalService", config);

String result = Retry.decorateSupplier(retry, () -> callExternalService()).get();

Exponential backoff

Fixed retry intervals can overwhelm a recovering service. Exponential backoff increases the delay between retries:

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialBackoff(
        Duration.ofMillis(500),  // initial wait
        2.0                      // multiplier
    ))
    .build();
// Retry 1: 500ms, Retry 2: 1000ms, Retry 3: 2000ms

With jitter

If 1000 clients retry at the same intervals, they all hit the service simultaneously (thundering herd). Add jitter:

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(500),
        2.0,
        0.5  // randomization factor
    ))
    .build();
// Retry delays are randomized ±50%

When NOT to retry

Non-transient errors: 400 Bad Request, 404 Not Found, validation errors
Non-idempotent operations: If the first call might have succeeded (payment charge), retrying could duplicate it
Already timed out: If the original request timed out for the caller, retrying is pointless

Timeout

A service call without a timeout can block forever. If the downstream service is stuck, your thread is stuck, your thread pool fills up, and your service stops responding.

TimeLimiterConfig config = TimeLimiterConfig.custom()
    .timeoutDuration(Duration.ofSeconds(3))
    .build();

TimeLimiter timeLimiter = TimeLimiter.of("externalService", config);

CompletableFuture<String> future = CompletableFuture.supplyAsync(() ->
    callSlowService()
);

String result = timeLimiter.executeFutureSupplier(() -> future);
// Throws TimeoutException after 3 seconds

Setting timeout values

Dependency	Suggested timeout
Database query	3–5 seconds
HTTP call to internal service	2–5 seconds
HTTP call to external API	5–10 seconds
Complex computation	Depends on SLA

Set timeouts based on the dependency’s expected response time + buffer. If a service normally responds in 100ms, a 3-second timeout catches degradation without being too aggressive.

Circuit breaker

A circuit breaker prevents calling a service that’s known to be failing. It has three states:

CLOSED (normal) → failures exceed threshold → OPEN (reject all calls)
                                                    ↓
                                              wait duration
                                                    ↓
                                              HALF_OPEN (try one call)
                                                    ↓
                                            success → CLOSED
                                            failure → OPEN

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)           // open after 50% failures
    .slidingWindowSize(10)              // in the last 10 calls
    .waitDurationInOpenState(Duration.ofSeconds(30))  // stay open for 30s
    .permittedNumberOfCallsInHalfOpenState(3)  // allow 3 test calls
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);

String result = CircuitBreaker.decorateSupplier(circuitBreaker, () ->
    callExternalService()
).get();
// Throws CallNotPermittedException when circuit is OPEN

Why it matters

Without a circuit breaker, if Service B is down:

Every request to Service A calls Service B
Each call waits for timeout (3 seconds)
Service A’s thread pool fills up
Service A becomes unresponsive
Service C, which calls Service A, also fills up
Cascade failure

With a circuit breaker:

After 5 failures in 10 calls, circuit opens
Subsequent calls fail immediately (no timeout wait)
Service A remains responsive
After 30 seconds, one test call checks if Service B recovered

Bulkhead

A bulkhead limits the number of concurrent calls to a dependency. Named after ship bulkheads — compartments that prevent a hull breach from sinking the entire ship.

// Thread pool bulkhead
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
    .maxThreadPoolSize(10)
    .coreThreadPoolSize(5)
    .queueCapacity(20)
    .build();

ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("externalService", config);

CompletableFuture<String> result = ThreadPoolBulkhead.decorateSupplier(
    bulkhead, () -> callExternalService()
).get();
// Throws BulkheadFullException when all 10 threads + 20 queue slots are used

Semaphore bulkhead (for virtual threads)

BulkheadConfig config = BulkheadConfig.custom()
    .maxConcurrentCalls(25)
    .maxWaitDuration(Duration.ofMillis(500))
    .build();

Bulkhead bulkhead = Bulkhead.of("externalService", config);

Why bulkheads matter

Without a bulkhead, a slow dependency can consume all your threads:

Thread Pool (200 threads):
- Payment Service: slow (using 180 threads, all waiting)
- User Service: fast (only 20 threads available)
- Catalog Service: no threads left → timeout

With bulkheads:

Payment Bulkhead: max 50 threads
User Bulkhead: max 50 threads
Catalog Bulkhead: max 50 threads
General: remaining 50 threads

Even if Payment is slow, it can only use 50 threads.
User and Catalog continue operating normally.

Fallback

When everything else fails, return a degraded response instead of an error:

String result;
try {
    result = callExternalService();
} catch (Exception e) {
    result = getFallbackResponse(); // cached data, default value, etc.
}

With Resilience4j

Supplier<String> decoratedSupplier = Decorators.ofSupplier(() -> callExternalService())
    .withRetry(retry)
    .withCircuitBreaker(circuitBreaker)
    .withFallback(List.of(
        CallNotPermittedException.class,
        TimeoutException.class,
        IOException.class
    ), e -> getFallbackResponse())
    .decorate();

String result = decoratedSupplier.get();

Fallback strategies

Strategy	Example
Cached data	Return the last known good response
Default value	Return a static default
Degraded response	Return partial data (skip the failing dependency)
Alternative service	Call a backup API
Queue for later	Accept the request, process when dependency recovers

// Fallback: cached data
String getFallbackPrice(String productId) {
    return priceCache.get(productId); // last known price
}

// Fallback: degraded response
UserProfile getFallbackProfile(String userId) {
    User user = userService.getUser(userId); // this works
    // Skip recommendations (that service is down)
    return new UserProfile(user, List.of(), List.of());
}

Combining patterns

Patterns work best together. Order matters:

Retry → Circuit Breaker → Bulkhead → Timeout → Fallback

Supplier<String> resilientCall = Decorators
    .ofSupplier(() -> callExternalService())
    .withRetry(retry)                  // retry transient failures
    .withCircuitBreaker(circuitBreaker) // stop calling if failing
    .withBulkhead(bulkhead)            // limit concurrency
    .withFallback(asList(Exception.class), e -> fallbackResponse())
    .decorate();

Spring Boot integration

@Configuration
class ResilienceConfig {

    @Bean
    fun paymentCircuitBreaker(): CircuitBreaker {
        val config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50f)
            .slidingWindowSize(10)
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .build()
        return CircuitBreaker.of("payment", config)
    }

    @Bean
    fun paymentRetry(): Retry {
        val config = RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(500))
            .retryExceptions(IOException::class.java)
            .build()
        return Retry.of("payment", config)
    }
}

@Service
class PaymentService(
    private val circuitBreaker: CircuitBreaker,
    private val retry: Retry,
    private val paymentClient: PaymentClient
) {
    fun processPayment(orderId: String, amount: Double): PaymentResult {
        val decoratedCall = Decorators.ofSupplier {
            paymentClient.charge(orderId, amount)
        }
            .withRetry(retry)
            .withCircuitBreaker(circuitBreaker)
            .withFallback(listOf(Exception::class.java)) {
                PaymentResult.Pending(orderId) // try again later
            }
            .decorate()

        return decoratedCall.get()
    }
}

Monitoring resilience

Track these metrics:

Metric	Why
Retry count	High retries = unstable dependency
Circuit state changes	CLOSED→OPEN = dependency degraded
Timeout rate	> 1% = dependency too slow
Bulkhead rejections	Resources exhausted
Fallback invocations	System operating in degraded mode

Resilience4j integrates with Micrometer for metrics:

CircuitBreaker circuitBreaker = CircuitBreaker.of("payment", config);
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(circuitBreakerRegistry)
    .bindTo(meterRegistry);

Summary

Pattern	Prevents	Tradeoff
Retry	Transient failures	Increased latency, possible duplicates
Timeout	Indefinite blocking	Might timeout during normal slow response
Circuit Breaker	Cascading failures	Temporary unavailability of a feature
Bulkhead	Resource exhaustion	Limited concurrency per dependency
Fallback	Total failure	Degraded user experience

Design for failure means accepting that dependencies will fail and planning what happens when they do. The goal isn’t zero failures — it’s graceful degradation: the system keeps working, perhaps with reduced functionality, while the failing component recovers.