Saga Pattern — Managing Distributed Transactions

In a monolith, you wrap everything in a database transaction:

@Transactional
fun placeOrder(order: Order) {
    orderRepository.save(order)
    paymentService.charge(order.total)
    inventoryService.reserve(order.items)
    emailService.sendConfirmation(order)
}

If any step fails, the transaction rolls back everything. Simple, safe, consistent.

In microservices, there’s no shared transaction. Each service has its own database. You can’t atomically update Order DB + Payment DB + Inventory DB. The Saga pattern is the solution — a sequence of local transactions coordinated through events, with compensation logic to handle failures.

What is a Saga?

A Saga is a sequence of local transactions. Each transaction updates a single service’s database and publishes an event. The next service reacts to the event and performs its own local transaction. If any step fails, the Saga executes compensating transactions to undo the previous steps.

Happy path:
  Order Service: Create order (PENDING)
  → Payment Service: Charge customer
  → Inventory Service: Reserve stock
  → Order Service: Confirm order (CONFIRMED)

Failure path (payment fails):
  Order Service: Create order (PENDING)
  → Payment Service: Charge fails
  → Order Service: Cancel order (CANCELLED)

Failure path (inventory fails):
  Order Service: Create order (PENDING)
  → Payment Service: Charge customer
  → Inventory Service: Reserve fails
  → Payment Service: Refund customer (compensation)
  → Order Service: Cancel order (CANCELLED)

Two approaches: Choreography vs Orchestration

Choreography — Event-driven

Each service listens for events and decides what to do. No central coordinator.

Order Service → publishes OrderCreated
                     ↓
Payment Service → processes payment → publishes PaymentProcessed
                                            ↓
Inventory Service → reserves stock → publishes StockReserved
                                            ↓
Order Service → confirms order → publishes OrderConfirmed

Each service reacts independently:

// Payment Service
@KafkaListener(topics = ["order-events"])
fun onOrderCreated(event: OrderCreated) {
    try {
        val result = paymentGateway.charge(event.customerId, event.total)
        kafkaTemplate.send("payment-events", PaymentProcessed(
            orderId = event.orderId,
            transactionId = result.transactionId
        ))
    } catch (e: PaymentDeclinedException) {
        kafkaTemplate.send("payment-events", PaymentFailed(
            orderId = event.orderId,
            reason = e.message
        ))
    }
}

// Inventory Service
@KafkaListener(topics = ["payment-events"])
fun onPaymentProcessed(event: PaymentProcessed) {
    try {
        inventoryRepository.reserve(event.orderId, event.items)
        kafkaTemplate.send("inventory-events", StockReserved(
            orderId = event.orderId
        ))
    } catch (e: InsufficientStockException) {
        kafkaTemplate.send("inventory-events", StockReservationFailed(
            orderId = event.orderId,
            reason = e.message
        ))
    }
}

// Payment Service — compensation
@KafkaListener(topics = ["inventory-events"])
fun onStockReservationFailed(event: StockReservationFailed) {
    paymentGateway.refund(event.orderId)
    kafkaTemplate.send("payment-events", PaymentRefunded(
        orderId = event.orderId
    ))
}

Pros: Simple, loosely coupled, no single point of failure. Cons: Hard to track the overall flow, scattered logic, difficult to debug.

Orchestration — Central coordinator

A central orchestrator (Saga manager) tells each service what to do:

class OrderSagaOrchestrator(
    private val orderService: OrderServiceClient,
    private val paymentService: PaymentServiceClient,
    private val inventoryService: InventoryServiceClient
) {
    suspend fun execute(command: PlaceOrderCommand): SagaResult {
        // Step 1: Create order
        val order = orderService.create(command)

        // Step 2: Process payment
        val paymentResult = try {
            paymentService.charge(order.id, command.total)
        } catch (e: Exception) {
            // Compensate step 1
            orderService.cancel(order.id)
            return SagaResult.Failed("Payment failed: ${e.message}")
        }

        // Step 3: Reserve inventory
        val inventoryResult = try {
            inventoryService.reserve(order.id, command.items)
        } catch (e: Exception) {
            // Compensate step 2 then step 1
            paymentService.refund(paymentResult.transactionId)
            orderService.cancel(order.id)
            return SagaResult.Failed("Inventory failed: ${e.message}")
        }

        // All steps succeeded
        orderService.confirm(order.id)
        return SagaResult.Success(order.id)
    }
}

Pros: Clear flow, easy to understand, centralized error handling. Cons: Single point of failure (orchestrator), tighter coupling, orchestrator becomes complex.

Which to choose?

Factor	Choreography	Orchestration
Complexity	Distributed (many listeners)	Centralized (one coordinator)
Coupling	Loose	Tighter
Visibility	Hard to trace	Easy to trace
Failure handling	Distributed compensation	Centralized compensation
Best for	Simple sagas (2-3 steps)	Complex sagas (4+ steps)

For 2-3 services, choreography is fine. For complex flows with many steps and conditional logic, orchestration is clearer.

Compensation transactions

Compensation is how you undo a step when a later step fails. It’s not a rollback — it’s a new transaction that reverses the effect.

Action                  → Compensation
Create order (PENDING)  → Cancel order (CANCELLED)
Charge payment          → Refund payment
Reserve inventory       → Release inventory
Send confirmation       → Send cancellation email

Compensation rules

Compensations are idempotent — running them twice should be safe. The refund might be triggered multiple times due to retries.
Compensations may fail — what if the refund fails? You need alerting, manual intervention, or retry logic.
Not all actions are compensable — you can’t unsend an email. Design around this (send confirmation only after the saga completes, not during).
Execute in reverse order — compensate the most recent action first, working backwards.

Saga state machine

Track the saga’s progress to handle failures correctly:

enum class SagaState {
    STARTED,
    PAYMENT_PENDING,
    PAYMENT_COMPLETED,
    INVENTORY_PENDING,
    INVENTORY_RESERVED,
    CONFIRMED,
    // Compensation states
    PAYMENT_REFUNDING,
    PAYMENT_REFUNDED,
    CANCELLING,
    CANCELLED,
    FAILED
}

data class OrderSaga(
    val sagaId: String,
    val orderId: String,
    var state: SagaState,
    var paymentTransactionId: String? = null,
    var failureReason: String? = null,
    val createdAt: Instant,
    var updatedAt: Instant
)

The saga state machine transitions:

STARTED → PAYMENT_PENDING → PAYMENT_COMPLETED → INVENTORY_PENDING → INVENTORY_RESERVED → CONFIRMED
                    ↓                                        ↓
               CANCELLING                          PAYMENT_REFUNDING → PAYMENT_REFUNDED → CANCELLED
                    ↓
                CANCELLED

Store the saga state in a database. On restart, resume from the last known state.

Idempotency

Messages can be delivered more than once. Every step must be idempotent:

@KafkaListener(topics = ["order-events"])
fun onOrderCreated(event: OrderCreated) {
    // Check if already processed
    val existing = sagaRepository.findByOrderId(event.orderId)
    if (existing != null && existing.state >= SagaState.PAYMENT_PENDING) {
        return // already processed, skip
    }

    // Process
    val result = paymentGateway.charge(event.customerId, event.total)
    sagaRepository.updateState(event.orderId, SagaState.PAYMENT_COMPLETED)
}

Use the saga ID or order ID as an idempotency key. Store processed events to detect duplicates.

Timeouts

What if a service never responds? Add timeouts:

class SagaTimeoutChecker(
    private val sagaRepository: SagaRepository,
    private val compensator: SagaCompensator
) {
    @Scheduled(fixedRate = 60_000) // check every minute
    fun checkTimeouts() {
        val stuckSagas = sagaRepository.findStuckSagas(
            olderThan = Instant.now().minus(Duration.ofMinutes(5))
        )

        stuckSagas.forEach { saga ->
            when (saga.state) {
                SagaState.PAYMENT_PENDING -> compensator.cancelOrder(saga)
                SagaState.INVENTORY_PENDING -> compensator.refundAndCancel(saga)
                else -> { /* no action needed */ }
            }
        }
    }
}

If a saga is stuck in a pending state for more than 5 minutes, trigger compensation.

Monitoring

Track saga metrics:

// Completion rate
val successRate = sagaRepository.countByState(SagaState.CONFIRMED).toDouble() /
    sagaRepository.countAll()

// Average duration
val avgDuration = sagaRepository.averageDuration()

// Failure reasons
val failureReasons = sagaRepository.groupByFailureReason()

Alert on:

High failure rate (> 5%)
Stuck sagas (pending for too long)
Compensation failures (refunds that didn’t go through)
Increasing average duration

Common mistakes

1. Treating saga as synchronous

Don’t wait for the entire saga to complete before responding to the user. Start the saga, return “Order placed” immediately, and update the user via push notifications or polling.

2. Missing compensation for a step

Every forward action needs a compensation. If you add a new step to the saga (e.g., “notify warehouse”), you must also add its compensation (“cancel warehouse notification”).

3. Not persisting saga state

If the orchestrator crashes, it needs to resume. Persist saga state to a database. Without persistence, in-flight sagas are lost.

4. Ignoring partial failures in compensation

If the refund fails, the saga is in an inconsistent state. You need alerting, retries with backoff, and eventually manual intervention for irrecoverable failures.

Summary

The Saga pattern replaces distributed transactions with a sequence of local transactions + compensations:

Choreography — services react to events independently (simple sagas)
Orchestration — central coordinator manages the flow (complex sagas)
Compensation — reverse previous steps on failure
Idempotency — every step must handle duplicate messages
State machine — track progress, resume on failure, detect timeouts

Sagas add complexity. Only use them when you genuinely have services with separate databases that need coordinated updates. If you can put it in one transaction, do that instead.