Consistent Hashing
Consistent hashing solves a specific problem: how to distribute keys across a dynamic cluster of nodes such that adding or removing a node moves the minimum possible number of keys. It is the foundation of distributed caches, Cassandra’s storage layer, and any system where the set of servers is not fixed.
The Problem with Naive Modulo Hashing
The intuitive approach to distributing keys across N nodes is:
node = hash(key) % NThis works perfectly when N never changes. When N changes — a node fails, or you add capacity — it breaks catastrophically.
Why? Each key’s node assignment is hash(key) % N. When N changes to N+1, % N and % (N+1) produce different results for almost all keys.
N = 3 nodes. hash(key) % 3:
key_A → hash=10 → 10%3 = 1 → Node 1
key_B → hash=11 → 11%3 = 2 → Node 2
key_C → hash=12 → 12%3 = 0 → Node 0
key_D → hash=13 → 13%3 = 1 → Node 1
Add 1 node. N = 4. hash(key) % 4:
key_A → hash=10 → 10%4 = 2 → Node 2 ← MOVED
key_B → hash=11 → 11%4 = 3 → Node 3 ← MOVED
key_C → hash=12 → 12%4 = 0 → Node 0 ← same
key_D → hash=13 → 13%4 = 1 → Node 1 ← sameIn this example, 2 of 4 keys moved. With uniformly distributed hashes and N nodes, adding one node remaps approximately (N−1) / N of all keys to different nodes:
| N (before) | Fraction of keys remapped |
|---|---|
| 3 → 4 | ~75% |
| 9 → 10 | ~90% |
| 99 → 100 | ~99% |
For a distributed cache with millions of keys, a 90% remapping rate on a scale-up event means a near-complete cache invalidation — every remapped key becomes a cache miss that hits the origin simultaneously. This is a thundering herd that can take down the origin.
Consistent hashing reduces the fraction of keys remapped to approximately 1 / (N+1) — only the keys that belong to the new node’s portion of the key space.
The Hash Ring
Map the hash output space onto a circle (ring). Both nodes and keys are hashed to positions on the ring.
Hash space: [0, 2^32) arranged as a ring
0
●
/ \
Node A Node C
● ●
| |
2^32/4 3*2^32/4
| |
Node B
●
\ /
2^32/2Node placement: Hash each node’s identifier (e.g., IP address or node_name) to a position on the ring.
Key lookup rule: To find which node owns a key, hash the key to a ring position and walk clockwise until you find a node. That node owns the key.
Ring positions (simplified to 0–99):
Node A: position 10
Node B: position 40
Node C: position 70
Key "foo" → hash=25 → walk clockwise → hits Node B at 40 → owned by Node B
Key "bar" → hash=55 → walk clockwise → hits Node C at 70 → owned by Node C
Key "baz" → hash=85 → walk clockwise → wraps around → hits Node A at 10 → owned by Node ANode arc ownership: Each node owns the arc of the ring between itself and the previous node (counter-clockwise). Node B at position 40 owns keys that hash to positions [11, 40].
Adding and Removing Nodes
Adding a Node
New node D is placed at position 55 on the ring.
Before: After adding Node D at 55:
...Node B(40)...Node C(70)... ...Node B(40)...Node D(55)...Node C(70)...
Keys previously owned by Node C → Keys in arc [41, 55] now owned by Node D
that hash to [41, 55] Keys in arc [56, 70] remain with Node COnly keys in the arc [41, 55] need to move — from Node C to Node D. No other keys are affected. With N existing nodes of equal size, the new node takes approximately 1/(N+1) of the total keys. For a 10-node cluster, adding one node moves ~9% of keys, not 90%.
Removing a Node (or Node Failure)
Node B at position 40 is removed.
Before: ...Node A(10)...Node B(40)...Node C(70)...
After: ...Node A(10)...Node C(70)...
Keys owned by Node B [11, 40] → reassigned to Node C (next clockwise node)Only keys that were on Node B move. All other keys are unaffected.
Virtual Nodes
Plain consistent hashing with one ring position per physical node has a critical flaw: non-uniform distribution.
With 3 nodes, the arcs are not guaranteed to be equal. If Node A lands at position 10, Node B at 11, and Node C at 90, then:
- Node A owns arc [91, 10] = 20 positions (20% of keys)
- Node B owns arc [11, 11] = 1 position (1% of keys) ← massively under-loaded
- Node C owns arc [12, 90] = 79 positions (79% of keys) ← massively over-loaded
Real hash functions produce roughly uniform positions across many nodes, but with small clusters the variance is high.
Virtual nodes (vnodes): Each physical node is assigned multiple positions on the ring instead of one. Each position is a virtual node — a separate hash of the physical node identity.
Physical node A → Virtual positions: A_1(8), A_2(35), A_3(67), A_4(89)
Physical node B → Virtual positions: B_1(22), B_2(48), B_3(72), B_4(95)
Physical node C → Virtual positions: C_1(14), C_2(55), C_3(80), C_4(99)
Ring (positions 0–99):
8 14 22 35 48 55 67 72 80 89 95 99
[A1][C1][B1][A2][B2][C2][A3][B3][C3][A4][B4][C4]Benefits of vnodes:
| Benefit | Why |
|---|---|
| Even distribution | With enough vnodes (128–256 per node), statistical variance in load averages out |
| Smoother rebalancing | When a node is added, it takes small arcs from many existing nodes — load spreads evenly rather than burdening one neighbor |
| Heterogeneous hardware | Assign more vnodes to more powerful machines — they own more arcs and more keys proportionally |
| Faster recovery | A failed node’s keys are redistributed to many nodes (each taking one small arc) rather than one successor node taking all the load |
Vnode count tradeoff: More vnodes = better distribution, but also larger routing metadata (each node must know all other nodes’ vnode positions). Cassandra defaults to 256 vnodes per node. For very large clusters (1000+ nodes), 8–16 vnodes is a pragmatic compromise.
Replication on the Ring
In a distributed storage system, each key is replicated to R nodes for fault tolerance. The standard approach: store the key on the R successive nodes clockwise from the key’s position.
Replication factor R = 3:
key "foo" → position 25 → clockwise: Node B(40), Node C(70), Node A(10+wrap)
↑primary ↑replica1 ↑replica2Why this interacts well with consistent hashing: When Node B fails, Node C (already a replica) promotes to primary for those keys. The system reads from the next available replica with no data loss — the replica was already there.
With vnodes: the R successive clockwise virtual nodes must belong to R different physical nodes — otherwise a single machine failure loses all replicas. Cassandra enforces this by skipping vnodes that belong to a physical node already in the replica set.
Real-World Applications
Cassandra Token Ring
Cassandra uses consistent hashing as its core partitioning mechanism. The ring positions are called tokens.
Row key → Murmur3 hash → token (64-bit integer)
Token → mapped to the node that owns that token rangeIn Cassandra 3.0+, vnodes (called “token ranges”) are the default with 256 vnodes per node. The coordinator node (the node that receives a client request) routes the request to the primary replica using the ring, then waits for quorum acknowledgement from the replicas.
Memcached (libketama)
Consistent hashing for Memcached was popularized by libketama (Last.fm, 2007), which added client-side consistent hashing to Memcached’s simpler modulo-based approach. The client hashes both the server list and each key to a ring, routing each key to the correct server without a central coordinator.
Load Balancer Session Affinity
Some L7 load balancers use consistent hashing on a client attribute (source IP, session cookie, user ID) to route requests from the same client to the same backend server. Unlike IP hash (which breaks when backends are added), consistent hashing moves only 1/N of existing sessions on a scale event — most sessions stay with their current backend.
Redis Cluster (Hash Slots — a related approach)
Redis Cluster uses 16,384 fixed hash slots rather than a continuous ring:
slot = CRC16(key) % 16384
slot 0–5460 → Node A
slot 5461–10922 → Node B
slot 10923–16383 → Node CThis is conceptually similar to virtual nodes on a ring but with a fixed discrete number of slots. When resharding, individual slots are moved between nodes. The slot count (16,384) is chosen to allow fine-grained redistribution while keeping the slot map small enough to gossip between nodes (2 KB bitmap).
Hotspot Mitigation
Virtual nodes reduce statistical load imbalance, but they cannot solve application-level hotspots — a single key receiving disproportionate traffic.
Example: A viral tweet with tweet_id=99 receives 1 million requests per second. Consistent hashing assigns it to one node. Every request for that tweet hits the same node regardless of how many vnodes exist — vnodes distribute different keys, not traffic for the same key.
Strategies:
| Strategy | How | When to use |
|---|---|---|
| Application-level cache | Cache the hot object in local in-process memory — requests never reach the distributed cache | Hot objects known ahead of time (product launches, featured content) |
| Key replication with random suffix | Store the hot key as N copies: tweet:99:shard_0 through tweet:99:shard_7 — reads pick a random shard | Hot keys not predictable; read-heavy |
| Explicit shard splitting | Manually assign a hot partition to its own dedicated node | Known-hot partitions in Cassandra (override token assignment) |
| Read replicas with scatter | Direct reads for hot keys to replicas; LB distributes across R replicas | Works with consistent hashing replication |
| CDN / edge cache | Cache the hot object at CDN edge nodes — requests never reach the origin cluster | Public, cacheable content |
The fundamental limit: Consistent hashing distributes a keyspace across nodes. It cannot distribute a single key’s traffic across nodes — that requires application-level routing logic above the hashing layer.
Summary: Why Consistent Hashing Works
| Property | Naive mod hashing | Consistent hashing |
|---|---|---|
| Keys remapped on node add | ~(N-1)/N ≈ 90% for large N | ~1/(N+1) ≈ 9% for N=10 |
| Keys remapped on node remove | ~(N-1)/N ≈ 90% | Only keys on removed node (~1/N) |
| Load distribution | Uniform (by math) | Skewed without vnodes; uniform with vnodes |
| Hot key handling | Same problem | Same problem — needs application layer |
| Complexity | O(1) lookup | O(log N) lookup (binary search on sorted ring) |