Saga Pattern

A user places an order. Behind the scenes you need to: create the order in the Order service, charge the card via the Payment service, reserve stock in the Inventory service, and dispatch a shipment via the Shipping service — and if any step fails partway through, you need to undo the earlier steps so the user isn’t charged for an order that never ships. In a monolith, one ACID transaction handles this. Across microservices each owning their own database, 2PC is the wrong answer — it requires XA support everywhere, holds cross-service locks during slow operations, and re-couples the services you decoupled. The Saga pattern is what you reach for instead: a chain of local transactions, each with a compensating action, orchestrated through to success or fully rolled back through compensation.

A Saga is a sequence of local transactions across multiple services, where each step has a compensating transaction that undoes its effect if a later step fails. It replaces distributed transactions (2PC) in microservice architectures where holding cross-service locks is impractical.

Why Not 2PC Across Services?

In a monolith, a single database transaction handles atomicity. In microservices, each service owns its database — there is no shared transaction manager.

ApproachProblem
2PC across servicesRequires all services to support XA. Locks held across network boundaries. One slow service blocks all others. Tight coupling — the opposite of why you chose microservices.
“Just use one database”Defeats the purpose of service autonomy. Schema coupling across teams.
SagaEach service commits locally. Failures trigger compensating transactions. No distributed locks.

How a Saga Works

A saga breaks a business transaction into a series of steps T1 → T2 → … → Tn, where each Ti is a local ACID transaction within one service. For each Ti, there is a compensating transaction Ci that semantically reverses Ti’s effect.

Happy path:     T1 → T2 → T3 → T4 → SUCCESS

Failure at T3:  T1 → T2 → T3 ✗ → C2 → C1 → ABORTED
                              (compensate in reverse order)

Compensating transactions are not database rollbacks — they are new forward transactions that undo the business effect. A “refund” is a new charge of negative amount, not a DELETE of the original payment record.

Orchestration vs Choreography

Orchestration (Recommended)

A central Saga Orchestrator directs each step, sends commands to services, and manages the compensation flow.

sequenceDiagram
    participant O as Saga Orchestrator
    participant OS as Order Service
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service

    Note over O: Saga started — persisted to execution log

    O->>OS: CreateOrder
    OS->>O: OrderCreated (orderId=42)
    Note over O: Step 1 complete ✓

    O->>PS: ChargePayment ($99, orderId=42)
    PS->>O: PaymentCharged (paymentId=7)
    Note over O: Step 2 complete ✓

    O->>IS: ReserveInventory (SKU=ABC, qty=1)
    IS->>O: InventoryReserved (reservationId=15)
    Note over O: Step 3 complete ✓

    O->>SS: CreateShipment (orderId=42, address=...)
    SS->>O: ShipmentCreated
    Note over O: Step 4 complete ✓

    Note over O: Saga COMPLETED — all steps succeeded

When a step fails — compensation kicks in:

sequenceDiagram
    participant O as Saga Orchestrator
    participant OS as Order Service
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service

    O->>OS: CreateOrder
    OS->>O: OrderCreated ✓

    O->>PS: ChargePayment ($99)
    PS->>O: PaymentCharged ✓

    O->>IS: ReserveInventory (SKU=ABC)
    IS->>O: InventoryReserved ✓

    O->>SS: CreateShipment
    SS->>O: ❌ FAILED (out of delivery zone)

    Note over O: Step 4 failed — begin compensation in reverse

    O->>IS: ReleaseInventory (reservationId=15)
    IS->>O: Released ✓

    O->>PS: RefundPayment (paymentId=7)
    PS->>O: Refunded ✓

    O->>OS: CancelOrder (orderId=42)
    OS->>O: Cancelled ✓

    Note over O: Saga COMPENSATED — all effects reversed

Advantages of orchestration:

  • Clear, linear flow — easy to understand, test, and debug
  • Single place to see the saga’s state and step history
  • Adding or reordering steps is straightforward
  • Timeout and retry logic centralized

Choreography (Event-Driven)

No central coordinator. Each service publishes events after completing its step, and other services react.

flowchart LR
    OS[Order Service] -->|OrderCreated| PS[Payment Service]
    PS -->|PaymentCharged| IS[Inventory Service]
    IS -->|InventoryReserved| SS[Shipping Service]
    SS -->|ShipmentFailed| IS
    IS -->|InventoryReleased| PS
    PS -->|PaymentRefunded| OS
    OS -->|OrderCancelled| X[Done]
OrchestrationChoreography
CouplingServices coupled to orchestratorServices coupled to each other’s events
VisibilityOrchestrator has full stateNo single view — requires distributed tracing
ComplexityGrows linearly with stepsGrows combinatorially (each service must handle all event types)
Cyclic depsImpossible (orchestrator is the hub)Possible (A listens to B, B listens to A)
Best forComplex multi-step business processesSimple 2-3 step workflows
ℹ️

In system design interviews, default to orchestration. Say: “We’ll use a saga orchestrator because it gives us a single place to track the transaction state, handle retries, and trigger compensation. Choreography works for simple cases but becomes hard to reason about when the number of services grows.”

Compensating Transaction Rules

Compensating transactions are the hardest part of implementing sagas. They must follow strict rules:

1. Idempotent

A compensation may be retried if the orchestrator crashes and restarts. Calling RefundPayment(paymentId=7) twice must not issue two refunds.

First call:   RefundPayment(7) → creates refund, returns OK
Second call:  RefundPayment(7) → checks: refund already exists → returns OK (no-op)

2. Always Succeed (Eventually)

A compensation cannot fail permanently. If it does, the saga is stuck in an inconsistent state — some effects were applied but not all were reversed.

Design compensations to retry with exponential backoff. If a service is down, the orchestrator holds the compensation in a retry queue until the service recovers.

3. Semantically Reverse, Not Undo

StepCompensationNOT
Charge $99Refund $99 (new credit transaction)DELETE FROM payments
Reserve 1 unitRelease reservationDELETE FROM inventory
Send confirmation emailSend cancellation emailYou can’t unsend email
Ship packageCreate return label + notify carrierYou can’t un-ship

Some effects are not reversible (email sent, SMS sent, physical shipment). In these cases, compensations are corrective — they issue a follow-up action rather than undoing the original.

4. Compensation Order

Compensate in reverse order of execution. This ensures that downstream services don’t see inconsistencies — you undo the most recent effect first, just like unwinding a call stack.

Saga Execution Log

The orchestrator persists its state to a durable store so it can recover after a crash:

saga_execution_log:
┌──────┬────────┬───────────┬──────────┬─────────────────────┐
│ saga │ step   │ service   │ status   │ response            │
├──────┼────────┼───────────┼──────────┼─────────────────────┤
│ S-42 │ 1      │ Order     │ DONE     │ orderId=42          │
│ S-42 │ 2      │ Payment   │ DONE     │ paymentId=7         │
│ S-42 │ 3      │ Inventory │ DONE     │ reservationId=15    │
│ S-42 │ 4      │ Shipping  │ FAILED   │ "out of zone"       │
│ S-42 │ C3     │ Inventory │ DONE     │ released            │
│ S-42 │ C2     │ Payment   │ PENDING  │ (orchestrator crash) │
└──────┴────────┴───────────┴──────────┴─────────────────────┘

On recovery: orchestrator reads log → sees C2 is PENDING → retries RefundPayment

The Isolation Problem

Unlike 2PC, sagas do not provide isolation. Intermediate states are visible to concurrent transactions:

Timeline:
  T=0  Order created (status=PENDING)          ← other queries see PENDING order
  T=1  Payment charged                         ← money is gone
  T=2  Inventory reserved
  T=3  Shipping fails → begin compensation
  T=4  Inventory released                      ← but order still shows as PENDING
  T=5  Payment refunded
  T=6  Order cancelled

  Between T=1 and T=5, the user's payment is charged but the order isn't fulfilled.
  Between T=0 and T=6, another service querying orders sees an order that will be cancelled.

Countermeasures

TechniqueHow it worksExample
Semantic lockMark resources as “pending” during the sagaOrder status = PENDING_FULFILLMENT until saga completes
Commutative updatesDesign operations so order doesn’t matterCounter increments are commutative — saga rollback just decrements
Pessimistic viewReread current state before compensatingBefore refunding, check if payment still exists (it might have been separately voided)
Reread valueVerify the data hasn’t changed since the saga read itInclude version number in commands; reject if version mismatch

Saga vs 2PC — Decision Guide

Does the transaction span multiple services?
├── No → use a local ACID transaction
└── Yes
    ├── Are all services in the same database? → consider 2PC (XA)
    └── Different databases / different teams?
        ├── Can you tolerate temporary inconsistency? → Saga
        └── Need strict atomicity? → Redesign: merge services or use a shared database
2PCSaga
AtomicityAll-or-nothing, immediateAll-or-compensate, eventual
IsolationFull (locks held)None (intermediate states visible)
Lock durationEntire protocol (seconds)None (each step is local)
Failure modeBlocking (coordinator crash)Non-blocking (orchestrator recovers from log)
Best forSame-database cross-shardCross-service business transactions
⚠️

Common interview mistake: Candidates often say “we’ll use a saga to make this atomic.” Sagas are not atomic in the traditional sense — they provide eventual consistency through compensation. The correct framing is: “We’ll use a saga to coordinate this cross-service flow. The trade-off is temporary inconsistency during the saga execution window, which we mitigate with semantic locks and idempotent compensations.”

ℹ️

Interview tip: When a workflow spans multiple services with their own databases, I reach for a Saga and explicitly not 2PC — distributed transactions across microservices recouple the services you intentionally decoupled. I’d default to orchestration over choreography because a central orchestrator gives you a single place to track saga state, retry, and trigger compensations; choreography only stays simple for 2–3 step flows before the implicit event graph becomes impossible to reason about. The two correctness rules I’d state unprompted: compensating transactions must be idempotent (orchestrator may retry them after crashes, so RefundPayment(id) called twice must produce one refund), and they must always eventually succeed — design with retry queues, not give-up paths. And I’d flag the tradeoff honestly: sagas give eventual consistency, not atomicity, so I’d use semantic locks (status = PENDING_FULFILLMENT) to hide intermediate states from concurrent readers.