Leader-Based Replication
Most production databases replicate data using a single leader (primary) that handles all writes. Changes flow from the leader to one or more followers (replicas) that serve read traffic. This is the default replication model in PostgreSQL, MySQL, MongoDB, and most managed database services (RDS, Cloud SQL, Atlas).
How It Works
flowchart LR
C([Client]) -->|"all writes"| L[(Leader)]
L -->|"replication stream"| F1[(Follower 1)]
L -->|"replication stream"| F2[(Follower 2)]
C2([Client]) -->|"reads"| F1
C3([Client]) -->|"reads"| F2
C4([Client]) -->|"reads"| LEvery write goes to the leader. The leader persists the change locally (WAL fsync), then streams it to followers via a replication log. Followers apply changes in order and serve read traffic. The leader can also serve reads, but in read-heavy systems, offloading reads to followers reduces primary load.
Setting Up New Followers
Adding a new follower to a running cluster cannot be done by simply copying the data directory — clients are writing continuously, so a naive copy would be inconsistent.
Take a consistent snapshot
Capture a point-in-time snapshot of the leader’s data without locking the database for writes. Most databases provide this: pg_basebackup (PostgreSQL), mysqldump --single-transaction (MySQL), or filesystem-level snapshots.
Transfer the snapshot
Copy the snapshot to the new follower node. This can be hundreds of gigabytes for large databases — tools like rsync or cloud snapshot restore speed this up.
Connect and catch up
The follower connects to the leader and requests all changes since the snapshot’s position in the replication log (LSN in PostgreSQL, binlog coordinates in MySQL). The leader streams the backlog.
Follower is caught up
Once the follower has processed all pending changes, it is in sync and can serve read traffic. It continues receiving the live replication stream.
Handling Follower Failure
When a follower crashes and restarts, it knows the last transaction it applied (recorded in its local replication state). It reconnects to the leader and requests all changes from that point onward. The leader streams the missed transactions, and the follower catches up. No manual intervention needed — this is automatic in all modern databases.
Handling Leader Failure (Failover)
When the leader fails, a follower must be promoted to become the new leader. This process is called failover.
sequenceDiagram
participant C as Client
participant L as Leader (fails)
participant F1 as Follower 1 (promoted)
participant F2 as Follower 2
C->>L: write W1, W2, W3
L-->>F1: replicate W1, W2, W3
L-->>F2: replicate W1, W2
Note over L: Leader crashes
Note over F1,F2: Failover: F1 has most recent data → promoted to leader
C->>F1: writes now go to new leader
F1-->>F2: F2 reconfigured to replicate from F1Detect leader failure
Monitoring systems or follower nodes detect that the leader is unresponsive (missed heartbeats, failed health checks). Typical timeout: 10–30 seconds.
Select the new leader
The follower with the most recent replication state is the best candidate — it has the least data loss. In consensus-based systems (Raft), an election determines the winner.
Reconfigure the cluster
Clients must direct writes to the new leader. Other followers must replicate from the new leader. DNS updates, virtual IP failover, or proxy reconfiguration handles the routing change.
Challenges in Failover
| Challenge | Risk | Mitigation |
|---|---|---|
| Data loss | With async replication, the new leader may be missing writes that the old leader accepted but hadn’t replicated yet | Use semi-synchronous replication — at least one follower confirms each write before the leader ACKs |
| Split brain | Network partition causes two nodes to both believe they are the leader, accepting conflicting writes | Fencing tokens, epoch numbers, or STONITH (“Shoot The Other Node In The Head”) — forcibly power off the old leader |
| Client redirection | Clients continue writing to the old leader if DNS TTL hasn’t expired or connection pools are stale | Use a proxy layer (HAProxy, PgBouncer) with fast health checks; virtual IP failover for sub-second redirection |
Split brain is the most dangerous failure mode. If two nodes both accept writes as leader, those writes are conflicting and at least one set will be lost during reconciliation. Automated failover tools (Patroni for PostgreSQL, Orchestrator for MySQL) implement fencing to prevent this — they ensure the old leader is shut down before the new leader starts accepting writes.
Replication Topologies
Single leader, direct replicas (standard):
Leader → Follower 1
→ Follower 2
→ Follower 3
Chained (relay):
Leader → Follower 1 (relay) → Follower 2 → Follower 3
↑ reduces leader fan-out, but adds cumulative lag
Multi-region:
Leader (us-east) ─── async ───► Regional Follower (eu-west)
─── async ───► Regional Follower (ap-south)| Topology | Leader network load | Lag profile | Use case |
|---|---|---|---|
| Direct replicas | Fan-out to all | Lowest (no relay) | ≤5 replicas |
| Relay chain | Low (one stream) | Cumulative per hop | Many replicas, read-heavy |
| Multi-region async | Low (one stream per region) | 50–200ms cross-region | Geographic read locality |
Why Not Multi-Leader?
In leader-based replication, only one node accepts writes. This avoids the write conflict problem entirely — there’s a single serialization point for all mutations. Multi-leader replication (where multiple nodes accept writes) introduces write conflicts that require resolution strategies (last-write-wins, CRDTs, application-level merge). The complexity is significant and only justified when write latency from a single region is unacceptable — covered in Multi-Region Design.
What This Means in Practice
| Database | Default mode | Leader election | Replication log |
|---|---|---|---|
| PostgreSQL | Async streaming | Manual or Patroni | WAL (physical) or logical |
| MySQL | Async | Manual or Orchestrator/Group Replication | Binlog (statement or row) |
| MongoDB | Replica set (async, priority-based election) | Automatic (Raft-like) | Oplog |
| Amazon RDS | Multi-AZ (sync standby) + read replicas (async) | Automatic | WAL/binlog |
Interview tip: When discussing replication, say: “I’d use single-leader replication — the leader handles all writes, and followers replicate asynchronously for read scaling. For durability, I’d configure semi-synchronous replication so at least one follower confirms each write before the client gets an ACK — this means zero data loss on leader failure as long as the confirmed follower is promoted. Failover is handled by Patroni (PostgreSQL) or Orchestrator (MySQL) with fencing to prevent split brain. Reads that can tolerate seconds of lag go to followers; reads that must be current go to the leader.” This covers the write path, durability guarantee, failover safety, and read routing — the four things interviewers evaluate.
Test Your Understanding
The leader crashes. A follower with the latest data is promoted. But the old leader comes back online and has a write the new leader doesn’t have. What happens?
Split-brain resolution. The old leader’s unreplicated write conflicts with the new leader’s timeline. The old leader must truncate its log to the point where it diverged from the new leader’s timeline, then replicate forward from the new leader.
In PostgreSQL with Patroni, the old leader is fenced (prevented from accepting writes) via its HA proxy being redirected. It then uses pg_rewind to resync with the new leader. The unreplicated write is lost — this is expected with async replication. Semi-synchronous replication prevents this by ensuring at least one follower confirms every write before the leader acknowledges it.
You configure fully synchronous replication — the leader waits for ALL followers to confirm. A single slow follower makes all writes slow. Is there a middle ground?
Semi-synchronous replication. The leader waits for one follower to confirm, then acknowledges the client. Other followers replicate asynchronously.
This gives zero data loss on single-node failure (the confirmed follower has every committed write) while keeping write latency close to async (one network round-trip, not N). PostgreSQL calls this synchronous_commit = on with synchronous_standby_names set to one replica. MySQL has rpl_semi_sync_master_wait_for_slave_count=1.
If the synchronous follower crashes, the leader either blocks writes (safest) or downgrades to async (available but risks data loss). This is the availability-durability trade-off.