Two-Phase Commit (2PC)
A user transfers $100 between two accounts that happen to live on different database shards. The debit succeeds on shard A; the credit on shard B fails because the network blips just before commit. Now $100 has vanished — left one account but never arrived at the other. Reorder the writes and you get a different inconsistency. This is the atomic commit problem: how do you guarantee that all participants in a distributed transaction either commit together or abort together, even when any of them can crash at any moment? Two-Phase Commit is the canonical answer — used inside Spanner and CockroachDB for cross-shard transactions, and via XA for heterogeneous resources.
Two-Phase Commit is a protocol that ensures all participants in a distributed transaction either all commit or all abort. It solves the atomic commit problem: when a transaction spans multiple nodes, how do you guarantee that no node commits while another aborts?
The Atomic Commit Problem
A single database handles atomicity with a local transaction log — write the commit record, then it’s committed. But when a transaction touches data on multiple nodes (e.g., transferring money between two banks), a local commit on one node says nothing about the other.
The challenge: any participant can fail at any time. If Node A commits and Node B crashes before committing, the system is in an inconsistent state — the money left one account but never arrived at the other.
2PC solves this by splitting the commit into two phases, with a coordinator orchestrating the decision.
The Protocol
Phase 1 — Prepare
sequenceDiagram
participant C as Coordinator
participant A as Participant A (Bank 1)
participant B as Participant B (Bank 2)
C->>C: Write BEGIN to transaction log
par Prepare in parallel
C->>A: PREPARE (txn=T1, debit $100 from account X)
C->>B: PREPARE (txn=T1, credit $100 to account Y)
end
Note over A: Acquire row lock on account X
Write undo log (balance=500)
Write redo log (balance=400)
fsync to disk
Note over B: Acquire row lock on account Y
Write undo log (balance=200)
Write redo log (balance=300)
fsync to disk
A->>C: YES — I promise I can commit
B->>C: YES — I promise I can commitA YES vote is an irrevocable promise: the participant guarantees it can commit the transaction at any point in the future, even after a crash and recovery. This is why it must fsync its redo log before voting — the data must survive a reboot.
A NO vote means the participant cannot commit (constraint violation, deadlock, disk failure). The coordinator will abort the entire transaction.
Phase 2 — Commit or Abort
sequenceDiagram
participant C as Coordinator
participant A as Participant A
participant B as Participant B
Note over C: All voted YES
C->>C: Write COMMIT to transaction log (fsync) — the commit point
par Commit in parallel
C->>A: COMMIT (txn=T1)
C->>B: COMMIT (txn=T1)
end
Note over A: Apply redo log
Release row lock
Note over B: Apply redo log
Release row lock
A->>C: ACK
B->>C: ACK
C->>C: Write END to transaction logThe commit point is the moment the coordinator writes COMMIT to its durable log — not when participants acknowledge. Once that record is fsync’d, the transaction is committed regardless of subsequent crashes.
If any participant voted NO:
sequenceDiagram
participant C as Coordinator
participant A as Participant A
participant B as Participant B
A->>C: YES
B->>C: NO (constraint violation)
C->>C: Write ABORT to transaction log
par Abort in parallel
C->>A: ABORT (txn=T1)
C->>B: ABORT (txn=T1)
end
Note over A: Apply undo log
Release row lock
Note over B: Release lock (never modified data)Failure Modes
This is where 2PC gets interesting — and where it earns the label “blocking protocol.”
Participant Crash Before Voting
Coordinator sends PREPARE to A and B
A crashes before responding
Coordinator waits for timeout → no response from A
Coordinator writes ABORT → sends ABORT to B
B rolls back, releases locks
A recovers: checks with coordinator → "transaction was aborted" → rolls backImpact: The coordinator simply aborts. No blocking.
Participant Crash After Voting YES
A votes YES, then crashes immediately
Coordinator received YES from both A and B
Coordinator writes COMMIT, sends COMMIT to both
B commits successfully, A is down
A recovers: reads its redo log → "I voted YES for T1"
A asks coordinator: "What happened to T1?"
Coordinator: "T1 was committed"
A applies redo log → commitsImpact: A is temporarily unavailable but recovers correctly. Its YES vote guaranteed it could commit later.
Coordinator Crash — The Blocking Problem
This is the critical failure mode that defines 2PC’s weakness:
sequenceDiagram
participant C as Coordinator
participant A as Participant A
participant B as Participant B
par Prepare
C->>A: PREPARE
C->>B: PREPARE
end
A->>C: YES (locks held, redo log written)
B->>C: YES (locks held, redo log written)
Note over C: 💥 Coordinator crashes BEFORE
writing COMMIT or ABORT
Note over A: Voted YES — cannot abort unilaterally
Cannot commit — don't know if B voted YES
BLOCKED: holding locks, waiting for coordinator
Note over B: Same situation — BLOCKED
Note over A,B: Both participants stuck holding locks
until coordinator recovers and replays decisionWhy this is devastating:
- Participants hold row-level locks during the entire wait
- Other transactions that need those rows are blocked too
- If the coordinator’s disk is destroyed, participants may be stuck indefinitely
- No participant can safely make a unilateral decision — committing when the coordinator intended to abort (or vice versa) would violate atomicity
Coordinator Recovery
The coordinator’s durable transaction log enables recovery:
| Log State | Recovery Action |
|---|---|
| No record of T1 | Abort (never started Phase 2) |
BEGIN but no decision | Abort (PREPARE may have been sent, but no commit decided) |
COMMIT written | Re-send COMMIT to all participants |
ABORT written | Re-send ABORT to all participants |
END written | Nothing to do — transaction fully completed |
Participants that time out waiting for a decision can ask the coordinator (or, in cooperative recovery protocols, ask other participants) for the outcome.
Performance Cost
Timeline of a 2PC transaction:
Client ──── BEGIN TXN ──── PREPARE ──── COMMIT ──── END
│ │
Locks acquired ─┘ └─ Locks released
Lock hold time = network RTT (prepare) + coordinator fsync
+ network RTT (commit) + participant fsync
≈ 4-20ms locally, 100-400ms cross-datacenter| Cost | Detail |
|---|---|
| Network round trips | 2 (prepare + commit), 4 messages minimum |
| Forced disk writes | Coordinator: 2 (commit record, end record). Each participant: 1 (redo log before YES vote) |
| Lock duration | Entire protocol — not just the local operation |
| Latency | Bounded by the slowest participant |
Cross-datacenter 2PC is particularly painful: a 100ms RTT between datacenters means locks are held for 200-400ms, dramatically reducing throughput.
XA Transactions
XA is the industry standard API for 2PC across heterogeneous resources (e.g., PostgreSQL + Oracle + a message queue in the same transaction).
Application (Transaction Manager)
├── xa_start() → PostgreSQL (Resource Manager 1)
├── xa_start() → Oracle (Resource Manager 2)
│
│ ... execute SQL on both databases ...
│
├── xa_prepare() → PostgreSQL → YES
├── xa_prepare() → Oracle → YES
├── xa_commit() → PostgreSQL → OK
└── xa_commit() → Oracle → OKWhy XA is rare in practice:
- Each resource manager holds an open connection and locks during the entire protocol
- The transaction manager (coordinator) is a single point of failure
- Performance is 2-10x worse than local transactions
- Most modern applications use microservices with separate databases — XA doesn’t work across independent services
When to Use 2PC vs Alternatives
| Scenario | Use | Why |
|---|---|---|
| Cross-shard commit within one database (Spanner, CockroachDB) | 2PC | Controlled environment, low latency between shards, database manages coordinator |
| Cross-service business transaction (Order → Payment → Inventory) | Saga | Services are independent, locks across services are impractical |
| Database + message queue atomicity | Outbox pattern | Avoids distributed transaction entirely |
| Read-only cross-node query | Neither | No atomicity needed for reads |
Interview red flag: If a candidate proposes 2PC across microservices (“we’ll use a distributed transaction to ensure the order and payment are atomic”), that signals a misunderstanding of microservice architecture. 2PC requires tight coupling and shared failure domains — the opposite of why you chose microservices. The correct answer is the Saga pattern with compensating transactions.
Three-Phase Commit (3PC)
3PC adds a pre-commit phase between prepare and commit to eliminate the blocking problem:
Phase 1: PREPARE → YES/NO (same as 2PC)
Phase 2: PRE-COMMIT → ACK (participants know commit is coming)
Phase 3: COMMIT → ACKIf the coordinator crashes after pre-commit, participants know the decision was to commit (they received pre-commit) and can proceed without the coordinator. If they never received pre-commit, they can safely abort.
Why 3PC is not used in practice:
- Requires a synchronous network (bounded message delay) — unrealistic in real networks
- Network partitions can still cause inconsistency (one side commits, the other aborts)
- The extra round trip adds latency without solving the real-world problem
- Paxos/Raft-based commit protocols (e.g., Spanner) solve the problem properly
Interview tip: I treat 2PC as the right tool for cross-shard transactions inside one database product — Spanner and CockroachDB both layer 2PC on top of Paxos/Raft, which makes the coordinator itself fault-tolerant and eliminates the classical blocking-on-coordinator-crash problem. Across microservices, I’d never reach for 2PC: it requires XA support everywhere, holds locks across the network, and tightly couples services that we deliberately decoupled — the correct answer there is the Saga pattern with idempotent compensating transactions. The only time I’d consider XA in microservices is the narrow case of a single transaction spanning a database and a message broker — and even then I’d prefer the outbox pattern to avoid distributed transactions entirely.
Test Your Understanding
In 2PC, the coordinator sends PREPARE, all participants vote YES, and then the coordinator crashes before sending COMMIT. What happens to the participants?
They’re blocked. Each participant has voted YES and locked its resources (rows, tables), waiting for the coordinator’s decision. They can’t commit (no COMMIT received) or abort (they voted YES — the coordinator might send COMMIT when it recovers). This is the blocking problem — the fundamental weakness of 2PC.
Participants wait indefinitely or until a timeout triggers a heuristic abort (which risks inconsistency if the coordinator eventually sends COMMIT to other participants).
Fix in modern systems: Spanner and CockroachDB layer 2PC on top of Paxos/Raft, making the coordinator itself fault-tolerant — its decision is replicated to a quorum before any participant is told.
One participant votes NO during the PREPARE phase. What happens to participants that already voted YES?
The coordinator sends ABORT to all participants. The YES-voters release their locks and roll back their prepared changes using undo logs. The NO-voter also rolls back.
This is safe because no participant has committed — voting YES only means “I’m prepared to commit if you tell me to.” The coordinator’s decision is the single point of serialization. Since one participant said NO, the global decision is ABORT.
Why is 2PC acceptable within a single database (e.g., Spanner cross-shard transactions) but unacceptable across microservices?
Within a database: All participants (shard leaders) are controlled by the same system, speak the same protocol, and the coordinator can be made fault-tolerant (Paxos/Raft-backed). Lock durations are short (milliseconds). The database manages recovery internally.
Across microservices: Each service is independently deployed, may use different databases, and the XA protocol required for cross-service 2PC is poorly supported. Holding locks across network boundaries means a slow service blocks resources in every other service. If the coordinator crashes, all services are blocked until manual intervention. And 2PC tightly couples services that were deliberately decoupled.
The microservice alternative: Saga pattern with compensating transactions. Each step is a local transaction that’s independently committed. If step 3 fails, steps 1 and 2 are compensated (reversed). No distributed locks, no blocking, but requires designing idempotent compensating actions.