Designing for Failure — Retry, Timeout, Bulkhead & Fallback Patterns
Build resilient systems with failure patterns — retries with backoff, timeouts, circuit breakers, bulkheads, and fallbacks. Practical examples with Resilience4j.
In distributed systems, failure is not an edge case — it’s a normal operating condition. Networks partition. Services crash. Databases run slow. The question isn’t whether these things happen, but how your system behaves when they do.
The patterns in this post turn uncontrolled failures into controlled degradation. Instead of cascading crashes, your system retries, times out, falls back, and recovers.
The five resilience patterns
| Pattern | What it does | When to use |
|---|---|---|
| Retry | Try the operation again | Transient failures (network blip) |
| Timeout | Fail fast after a deadline | Prevent indefinite waiting |
| Circuit Breaker | Stop calling a failing service | Prevent cascading failure |
| Bulkhead | Isolate components | Prevent one failure from exhausting all resources |
| Fallback | Return a degraded response | Maintain partial functionality |
Retry with exponential backoff
Not all failures are permanent. A network glitch, a brief database overload, or a momentary service restart — retrying fixes these.
// Without retry
String result = callExternalService(); // fails once, game over
// With retry (Resilience4j)
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class) // don't retry business errors
.build();
Retry retry = Retry.of("externalService", config);
String result = Retry.decorateSupplier(retry, () -> callExternalService()).get();
Exponential backoff
Fixed retry intervals can overwhelm a recovering service. Exponential backoff increases the delay between retries:
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofExponentialBackoff(
Duration.ofMillis(500), // initial wait
2.0 // multiplier
))
.build();
// Retry 1: 500ms, Retry 2: 1000ms, Retry 3: 2000ms
With jitter
If 1000 clients retry at the same intervals, they all hit the service simultaneously (thundering herd). Add jitter:
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
Duration.ofMillis(500),
2.0,
0.5 // randomization factor
))
.build();
// Retry delays are randomized ±50%
When NOT to retry
- Non-transient errors: 400 Bad Request, 404 Not Found, validation errors
- Non-idempotent operations: If the first call might have succeeded (payment charge), retrying could duplicate it
- Already timed out: If the original request timed out for the caller, retrying is pointless
Timeout
A service call without a timeout can block forever. If the downstream service is stuck, your thread is stuck, your thread pool fills up, and your service stops responding.
TimeLimiterConfig config = TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(3))
.build();
TimeLimiter timeLimiter = TimeLimiter.of("externalService", config);
CompletableFuture<String> future = CompletableFuture.supplyAsync(() ->
callSlowService()
);
String result = timeLimiter.executeFutureSupplier(() -> future);
// Throws TimeoutException after 3 seconds
Setting timeout values
| Dependency | Suggested timeout |
|---|---|
| Database query | 3–5 seconds |
| HTTP call to internal service | 2–5 seconds |
| HTTP call to external API | 5–10 seconds |
| Complex computation | Depends on SLA |
Set timeouts based on the dependency’s expected response time + buffer. If a service normally responds in 100ms, a 3-second timeout catches degradation without being too aggressive.
Circuit breaker
A circuit breaker prevents calling a service that’s known to be failing. It has three states:
CLOSED (normal) → failures exceed threshold → OPEN (reject all calls)
↓
wait duration
↓
HALF_OPEN (try one call)
↓
success → CLOSED
failure → OPEN
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // open after 50% failures
.slidingWindowSize(10) // in the last 10 calls
.waitDurationInOpenState(Duration.ofSeconds(30)) // stay open for 30s
.permittedNumberOfCallsInHalfOpenState(3) // allow 3 test calls
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("externalService", config);
String result = CircuitBreaker.decorateSupplier(circuitBreaker, () ->
callExternalService()
).get();
// Throws CallNotPermittedException when circuit is OPEN
Why it matters
Without a circuit breaker, if Service B is down:
- Every request to Service A calls Service B
- Each call waits for timeout (3 seconds)
- Service A’s thread pool fills up
- Service A becomes unresponsive
- Service C, which calls Service A, also fills up
- Cascade failure
With a circuit breaker:
- After 5 failures in 10 calls, circuit opens
- Subsequent calls fail immediately (no timeout wait)
- Service A remains responsive
- After 30 seconds, one test call checks if Service B recovered
Bulkhead
A bulkhead limits the number of concurrent calls to a dependency. Named after ship bulkheads — compartments that prevent a hull breach from sinking the entire ship.
// Thread pool bulkhead
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(10)
.coreThreadPoolSize(5)
.queueCapacity(20)
.build();
ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("externalService", config);
CompletableFuture<String> result = ThreadPoolBulkhead.decorateSupplier(
bulkhead, () -> callExternalService()
).get();
// Throws BulkheadFullException when all 10 threads + 20 queue slots are used
Semaphore bulkhead (for virtual threads)
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(25)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Bulkhead bulkhead = Bulkhead.of("externalService", config);
Why bulkheads matter
Without a bulkhead, a slow dependency can consume all your threads:
Thread Pool (200 threads):
- Payment Service: slow (using 180 threads, all waiting)
- User Service: fast (only 20 threads available)
- Catalog Service: no threads left → timeout
With bulkheads:
Payment Bulkhead: max 50 threads
User Bulkhead: max 50 threads
Catalog Bulkhead: max 50 threads
General: remaining 50 threads
Even if Payment is slow, it can only use 50 threads.
User and Catalog continue operating normally.
Fallback
When everything else fails, return a degraded response instead of an error:
String result;
try {
result = callExternalService();
} catch (Exception e) {
result = getFallbackResponse(); // cached data, default value, etc.
}
With Resilience4j
Supplier<String> decoratedSupplier = Decorators.ofSupplier(() -> callExternalService())
.withRetry(retry)
.withCircuitBreaker(circuitBreaker)
.withFallback(List.of(
CallNotPermittedException.class,
TimeoutException.class,
IOException.class
), e -> getFallbackResponse())
.decorate();
String result = decoratedSupplier.get();
Fallback strategies
| Strategy | Example |
|---|---|
| Cached data | Return the last known good response |
| Default value | Return a static default |
| Degraded response | Return partial data (skip the failing dependency) |
| Alternative service | Call a backup API |
| Queue for later | Accept the request, process when dependency recovers |
// Fallback: cached data
String getFallbackPrice(String productId) {
return priceCache.get(productId); // last known price
}
// Fallback: degraded response
UserProfile getFallbackProfile(String userId) {
User user = userService.getUser(userId); // this works
// Skip recommendations (that service is down)
return new UserProfile(user, List.of(), List.of());
}
Combining patterns
Patterns work best together. Order matters:
Retry → Circuit Breaker → Bulkhead → Timeout → Fallback
Supplier<String> resilientCall = Decorators
.ofSupplier(() -> callExternalService())
.withRetry(retry) // retry transient failures
.withCircuitBreaker(circuitBreaker) // stop calling if failing
.withBulkhead(bulkhead) // limit concurrency
.withFallback(asList(Exception.class), e -> fallbackResponse())
.decorate();
Spring Boot integration
@Configuration
class ResilienceConfig {
@Bean
fun paymentCircuitBreaker(): CircuitBreaker {
val config = CircuitBreakerConfig.custom()
.failureRateThreshold(50f)
.slidingWindowSize(10)
.waitDurationInOpenState(Duration.ofSeconds(30))
.build()
return CircuitBreaker.of("payment", config)
}
@Bean
fun paymentRetry(): Retry {
val config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.retryExceptions(IOException::class.java)
.build()
return Retry.of("payment", config)
}
}
@Service
class PaymentService(
private val circuitBreaker: CircuitBreaker,
private val retry: Retry,
private val paymentClient: PaymentClient
) {
fun processPayment(orderId: String, amount: Double): PaymentResult {
val decoratedCall = Decorators.ofSupplier {
paymentClient.charge(orderId, amount)
}
.withRetry(retry)
.withCircuitBreaker(circuitBreaker)
.withFallback(listOf(Exception::class.java)) {
PaymentResult.Pending(orderId) // try again later
}
.decorate()
return decoratedCall.get()
}
}
Monitoring resilience
Track these metrics:
| Metric | Why |
|---|---|
| Retry count | High retries = unstable dependency |
| Circuit state changes | CLOSED→OPEN = dependency degraded |
| Timeout rate | > 1% = dependency too slow |
| Bulkhead rejections | Resources exhausted |
| Fallback invocations | System operating in degraded mode |
Resilience4j integrates with Micrometer for metrics:
CircuitBreaker circuitBreaker = CircuitBreaker.of("payment", config);
TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(circuitBreakerRegistry)
.bindTo(meterRegistry);
Summary
| Pattern | Prevents | Tradeoff |
|---|---|---|
| Retry | Transient failures | Increased latency, possible duplicates |
| Timeout | Indefinite blocking | Might timeout during normal slow response |
| Circuit Breaker | Cascading failures | Temporary unavailability of a feature |
| Bulkhead | Resource exhaustion | Limited concurrency per dependency |
| Fallback | Total failure | Degraded user experience |
Design for failure means accepting that dependencies will fail and planning what happens when they do. The goal isn’t zero failures — it’s graceful degradation: the system keeps working, perhaps with reduced functionality, while the failing component recovers.