Saga Pattern — Managing Distributed Transactions
How to implement the Saga pattern for distributed transactions in microservices — choreography vs orchestration, compensation, and practical implementation with Kafka.
In a monolith, you wrap everything in a database transaction:
@Transactional
fun placeOrder(order: Order) {
orderRepository.save(order)
paymentService.charge(order.total)
inventoryService.reserve(order.items)
emailService.sendConfirmation(order)
}
If any step fails, the transaction rolls back everything. Simple, safe, consistent.
In microservices, there’s no shared transaction. Each service has its own database. You can’t atomically update Order DB + Payment DB + Inventory DB. The Saga pattern is the solution — a sequence of local transactions coordinated through events, with compensation logic to handle failures.
What is a Saga?
A Saga is a sequence of local transactions. Each transaction updates a single service’s database and publishes an event. The next service reacts to the event and performs its own local transaction. If any step fails, the Saga executes compensating transactions to undo the previous steps.
Happy path:
Order Service: Create order (PENDING)
→ Payment Service: Charge customer
→ Inventory Service: Reserve stock
→ Order Service: Confirm order (CONFIRMED)
Failure path (payment fails):
Order Service: Create order (PENDING)
→ Payment Service: Charge fails
→ Order Service: Cancel order (CANCELLED)
Failure path (inventory fails):
Order Service: Create order (PENDING)
→ Payment Service: Charge customer
→ Inventory Service: Reserve fails
→ Payment Service: Refund customer (compensation)
→ Order Service: Cancel order (CANCELLED)
Two approaches: Choreography vs Orchestration
Choreography — Event-driven
Each service listens for events and decides what to do. No central coordinator.
Order Service → publishes OrderCreated
↓
Payment Service → processes payment → publishes PaymentProcessed
↓
Inventory Service → reserves stock → publishes StockReserved
↓
Order Service → confirms order → publishes OrderConfirmed
Each service reacts independently:
// Payment Service
@KafkaListener(topics = ["order-events"])
fun onOrderCreated(event: OrderCreated) {
try {
val result = paymentGateway.charge(event.customerId, event.total)
kafkaTemplate.send("payment-events", PaymentProcessed(
orderId = event.orderId,
transactionId = result.transactionId
))
} catch (e: PaymentDeclinedException) {
kafkaTemplate.send("payment-events", PaymentFailed(
orderId = event.orderId,
reason = e.message
))
}
}
// Inventory Service
@KafkaListener(topics = ["payment-events"])
fun onPaymentProcessed(event: PaymentProcessed) {
try {
inventoryRepository.reserve(event.orderId, event.items)
kafkaTemplate.send("inventory-events", StockReserved(
orderId = event.orderId
))
} catch (e: InsufficientStockException) {
kafkaTemplate.send("inventory-events", StockReservationFailed(
orderId = event.orderId,
reason = e.message
))
}
}
// Payment Service — compensation
@KafkaListener(topics = ["inventory-events"])
fun onStockReservationFailed(event: StockReservationFailed) {
paymentGateway.refund(event.orderId)
kafkaTemplate.send("payment-events", PaymentRefunded(
orderId = event.orderId
))
}
Pros: Simple, loosely coupled, no single point of failure. Cons: Hard to track the overall flow, scattered logic, difficult to debug.
Orchestration — Central coordinator
A central orchestrator (Saga manager) tells each service what to do:
class OrderSagaOrchestrator(
private val orderService: OrderServiceClient,
private val paymentService: PaymentServiceClient,
private val inventoryService: InventoryServiceClient
) {
suspend fun execute(command: PlaceOrderCommand): SagaResult {
// Step 1: Create order
val order = orderService.create(command)
// Step 2: Process payment
val paymentResult = try {
paymentService.charge(order.id, command.total)
} catch (e: Exception) {
// Compensate step 1
orderService.cancel(order.id)
return SagaResult.Failed("Payment failed: ${e.message}")
}
// Step 3: Reserve inventory
val inventoryResult = try {
inventoryService.reserve(order.id, command.items)
} catch (e: Exception) {
// Compensate step 2 then step 1
paymentService.refund(paymentResult.transactionId)
orderService.cancel(order.id)
return SagaResult.Failed("Inventory failed: ${e.message}")
}
// All steps succeeded
orderService.confirm(order.id)
return SagaResult.Success(order.id)
}
}
Pros: Clear flow, easy to understand, centralized error handling. Cons: Single point of failure (orchestrator), tighter coupling, orchestrator becomes complex.
Which to choose?
| Factor | Choreography | Orchestration |
|---|---|---|
| Complexity | Distributed (many listeners) | Centralized (one coordinator) |
| Coupling | Loose | Tighter |
| Visibility | Hard to trace | Easy to trace |
| Failure handling | Distributed compensation | Centralized compensation |
| Best for | Simple sagas (2-3 steps) | Complex sagas (4+ steps) |
For 2-3 services, choreography is fine. For complex flows with many steps and conditional logic, orchestration is clearer.
Compensation transactions
Compensation is how you undo a step when a later step fails. It’s not a rollback — it’s a new transaction that reverses the effect.
Action → Compensation
Create order (PENDING) → Cancel order (CANCELLED)
Charge payment → Refund payment
Reserve inventory → Release inventory
Send confirmation → Send cancellation email
Compensation rules
-
Compensations are idempotent — running them twice should be safe. The refund might be triggered multiple times due to retries.
-
Compensations may fail — what if the refund fails? You need alerting, manual intervention, or retry logic.
-
Not all actions are compensable — you can’t unsend an email. Design around this (send confirmation only after the saga completes, not during).
-
Execute in reverse order — compensate the most recent action first, working backwards.
Saga state machine
Track the saga’s progress to handle failures correctly:
enum class SagaState {
STARTED,
PAYMENT_PENDING,
PAYMENT_COMPLETED,
INVENTORY_PENDING,
INVENTORY_RESERVED,
CONFIRMED,
// Compensation states
PAYMENT_REFUNDING,
PAYMENT_REFUNDED,
CANCELLING,
CANCELLED,
FAILED
}
data class OrderSaga(
val sagaId: String,
val orderId: String,
var state: SagaState,
var paymentTransactionId: String? = null,
var failureReason: String? = null,
val createdAt: Instant,
var updatedAt: Instant
)
The saga state machine transitions:
STARTED → PAYMENT_PENDING → PAYMENT_COMPLETED → INVENTORY_PENDING → INVENTORY_RESERVED → CONFIRMED
↓ ↓
CANCELLING PAYMENT_REFUNDING → PAYMENT_REFUNDED → CANCELLED
↓
CANCELLED
Store the saga state in a database. On restart, resume from the last known state.
Idempotency
Messages can be delivered more than once. Every step must be idempotent:
@KafkaListener(topics = ["order-events"])
fun onOrderCreated(event: OrderCreated) {
// Check if already processed
val existing = sagaRepository.findByOrderId(event.orderId)
if (existing != null && existing.state >= SagaState.PAYMENT_PENDING) {
return // already processed, skip
}
// Process
val result = paymentGateway.charge(event.customerId, event.total)
sagaRepository.updateState(event.orderId, SagaState.PAYMENT_COMPLETED)
}
Use the saga ID or order ID as an idempotency key. Store processed events to detect duplicates.
Timeouts
What if a service never responds? Add timeouts:
class SagaTimeoutChecker(
private val sagaRepository: SagaRepository,
private val compensator: SagaCompensator
) {
@Scheduled(fixedRate = 60_000) // check every minute
fun checkTimeouts() {
val stuckSagas = sagaRepository.findStuckSagas(
olderThan = Instant.now().minus(Duration.ofMinutes(5))
)
stuckSagas.forEach { saga ->
when (saga.state) {
SagaState.PAYMENT_PENDING -> compensator.cancelOrder(saga)
SagaState.INVENTORY_PENDING -> compensator.refundAndCancel(saga)
else -> { /* no action needed */ }
}
}
}
}
If a saga is stuck in a pending state for more than 5 minutes, trigger compensation.
Monitoring
Track saga metrics:
// Completion rate
val successRate = sagaRepository.countByState(SagaState.CONFIRMED).toDouble() /
sagaRepository.countAll()
// Average duration
val avgDuration = sagaRepository.averageDuration()
// Failure reasons
val failureReasons = sagaRepository.groupByFailureReason()
Alert on:
- High failure rate (> 5%)
- Stuck sagas (pending for too long)
- Compensation failures (refunds that didn’t go through)
- Increasing average duration
Common mistakes
1. Treating saga as synchronous
Don’t wait for the entire saga to complete before responding to the user. Start the saga, return “Order placed” immediately, and update the user via push notifications or polling.
2. Missing compensation for a step
Every forward action needs a compensation. If you add a new step to the saga (e.g., “notify warehouse”), you must also add its compensation (“cancel warehouse notification”).
3. Not persisting saga state
If the orchestrator crashes, it needs to resume. Persist saga state to a database. Without persistence, in-flight sagas are lost.
4. Ignoring partial failures in compensation
If the refund fails, the saga is in an inconsistent state. You need alerting, retries with backoff, and eventually manual intervention for irrecoverable failures.
Summary
The Saga pattern replaces distributed transactions with a sequence of local transactions + compensations:
- Choreography — services react to events independently (simple sagas)
- Orchestration — central coordinator manages the flow (complex sagas)
- Compensation — reverse previous steps on failure
- Idempotency — every step must handle duplicate messages
- State machine — track progress, resume on failure, detect timeouts
Sagas add complexity. Only use them when you genuinely have services with separate databases that need coordinated updates. If you can put it in one transaction, do that instead.