Wide-Column Stores (Cassandra)
Cassandra is a leaderless, distributed wide-column store built for high write throughput and linear horizontal scale. It uses LSM trees for storage and consistent hashing for data distribution. Every architectural decision in Cassandra is a deliberate tradeoff: writes are always fast, reads are bounded, and consistency is tunable per operation.
Data Model
Cassandra’s data model looks like SQL tables but behaves fundamentally differently. The schema is a query specification — you design tables around access patterns, not around entities.
Every table has three components:
CREATE TABLE messages (
user_id UUID, -- partition key: determines which node stores this row
sent_at TIMESTAMP, -- clustering key: sort order within the partition
msg_id UUID, -- clustering key (secondary sort)
body TEXT, -- column value
PRIMARY KEY ((user_id), sent_at, msg_id)
)
WITH CLUSTERING ORDER BY (sent_at DESC, msg_id ASC);| Component | Role | Analogy |
|---|---|---|
| Partition key | Routes to the correct node(s); all rows with the same partition key live on the same node | Shard key |
| Clustering key | Sorts rows within a partition; enables range queries within a partition | ORDER BY + row grouping |
| Columns | The actual values; sparse — different rows can omit different columns | Column values |
What this enables: SELECT * FROM messages WHERE user_id = ? ORDER BY sent_at DESC LIMIT 50 is a single-node, sequential disk read — extremely fast. SELECT * FROM messages WHERE body LIKE '%hello%' is not supported — there is no index and no cross-partition scan.
Composite partition key: Spread data across more partitions by adding a column to the partition key:
PRIMARY KEY ((user_id, year_month), sent_at, msg_id)
-- messages for user_id=42 in month 2024-01 form one partition
-- messages for user_id=42 in month 2024-02 form another partitionDistribution: Token Ring and Virtual Nodes
Cassandra places nodes on a circular hash ring. Each key’s partition key is hashed to a 64-bit token (-2^63 to 2^63 - 1). The key is assigned to the first node clockwise on the ring that “owns” that token range.
Node A
(0 → 3000)
↗ ↘
Node D ────────── Node B
(9000 → 0) (3000 → 6000)
↖ ↙
Node C
(6000 → 9000)
key "user:42" → hash → token 4200 → owned by Node BVirtual nodes (vnodes): Instead of each physical node owning one contiguous arc, each node owns ~256 small token ranges scattered around the ring. Benefits:
- Adding a node takes small ranges from many existing nodes — no single node is overwhelmed
- Node failure redistributes load across all remaining nodes instead of one neighbor
- Heterogeneous hardware: assign more vnodes to larger machines
Replication factor (RF): Each partition is stored on RF nodes — the owning node plus the next RF-1 nodes clockwise. RF=3 is standard: one primary replica and two additional replicas.
RF=3: key → Node B (primary), Node C (replica 1), Node D (replica 2)Write Path
Cassandra writes are always fast because they are sequential I/O operations — the same LSM tree write path.
sequenceDiagram
participant C as Client
participant Co as Coordinator (any node)
participant B as Replica Node B
participant Cv as Replica Node C
participant D as Replica Node D
C->>Co: Write (user_id=42, sent_at=now, body="hello")
par Parallel writes to all RF=3 replicas
Co->>B: commit log fsync → memtable
Co->>Cv: commit log fsync → memtable
Co->>D: commit log fsync → memtable
end
B-->>Co: ACK
Cv-->>Co: ACK
Note over Co: QUORUM met (2 of 3 ACKed)
Co->>C: Write acknowledged
Note over B,D: Background: memtable → SSTable flush → compactionCoordinator: Any node in the cluster can act as the coordinator for a request. The coordinator routes writes to the correct replica nodes and waits for acknowledgements according to the consistency level.
Hinted handoff: If a replica node is temporarily down during a write, the coordinator stores the write locally as a “hint.” When the down node comes back online, the coordinator delivers the pending hints, ensuring the replica eventually catches up. Hints expire (default 3 hours) — if a node is down longer, manual repair is needed.
Read Path
Reads are more expensive than writes. Cassandra must merge data from multiple SSTables and check multiple replicas depending on the consistency level.
sequenceDiagram
participant C as Client
participant Co as Coordinator
participant B as Replica Node B
participant Cv as Replica Node C
C->>Co: Read (user_id=42)
Co->>B: Fetch data (bloom filter → SSTable lookup)
Co->>Cv: Fetch digest only
B-->>Co: Full data + digest
Cv-->>Co: Digest only
alt Digests match
Co->>C: Return data
else Digests differ (stale replica)
Co->>Cv: Fetch full data for read repair
Cv-->>Co: Stale data
Note over Co: LWW reconciliation — latest timestamp wins
Co-->>Cv: Write corrected value back (async read repair)
Co->>C: Return latest data
endBloom filter: Before touching any SSTable, Cassandra checks a per-SSTable Bloom filter. If the filter says “definitely not here,” that SSTable is skipped. This is the critical optimization that prevents reading every SSTable on every query.
Read repair: When replicas return different data (stale replica due to missed write), the coordinator reconciles using last-write-wins (LWW) timestamp ordering and asynchronously writes the correct value back to the stale replica.
Consistency Levels
Cassandra’s consistency is tunable per operation. The W + R > RF formula determines whether a read always sees the latest write (strong consistency).
| Level | Nodes that must acknowledge | Strong consistency (RF=3)? | Latency |
|---|---|---|---|
ONE | 1 | ❌ (may miss latest write) | Lowest |
LOCAL_ONE | 1 in local datacenter | ❌ | Lowest |
QUORUM | Majority: floor(RF/2) + 1 = 2 of 3 | ✅ (W=QUORUM + R=QUORUM → 2+2>3) | Medium |
LOCAL_QUORUM | Majority in local DC only | ✅ within DC | Medium |
ALL | All RF nodes | ✅ | Highest — loses availability if any replica is down |
The practical choice: LOCAL_QUORUM for both reads and writes is the standard production setting. It provides strong consistency within a datacenter while tolerating one replica failure, and avoids the cross-datacenter latency of QUORUM.
QUORUM write + QUORUM read = strong consistency:
Write acknowledged by 2/3 nodes → at least one node has the latest write
Read from 2/3 nodes → that read set always overlaps the write setConsistency vs availability tradeoff:
ONE: maximum availability, minimum consistency — fine for append-only event loggingQUORUM: balanced — the default for most OLTP-like accessALL: maximum consistency, minimum availability — rarely used (one node failure = unavailable)
Compaction Strategies
Cassandra accumulates SSTables from memtable flushes. Compaction merges them, removes tombstones, and maintains read performance. The strategy choice depends on the workload.
| SizeTieredCompaction (STCS) | LeveledCompaction (LCS) | |
|---|---|---|
| How it works | Merge SSTables of similar size into larger ones | Organize into levels; L1 and above have non-overlapping key ranges |
| Read amplification | Higher — many SSTables may contain the same key | Low — at most one SSTable per level contains a given key |
| Write amplification | Low — each key rewritten fewer times | High — keys cascades through levels |
| Space amplification | High — up to 2× temp space during compaction | Low |
| Best for | Write-heavy (time-series, logging, high ingest) | Read-heavy, latency-sensitive, space-constrained |
| Default | Yes | No (opt-in per table) |
CREATE TABLE sensor_readings (...)
WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160};Tombstone warning: Deleted rows leave tombstone markers rather than immediate removal. Tombstones are only purged during compaction after the gc_grace_seconds period (default 10 days — allows hinted handoff to deliver missed deletes before they’re gone). A table with heavy deletions accumulates tombstones that slow reads until compaction runs. Monitor tombstone_scanned in query traces.
Schema Design Rules
Cassandra enforces a discipline that SQL does not: design one table per query pattern. This is not a workaround — it is the intended approach.
Rule 1: Know your queries before designing your schema.
What are the exact WHERE clauses? Those become partition keys and clustering keys.
Rule 2: Denormalize intentionally. The same data may exist in multiple tables serving different query shapes. This is correct — Cassandra trades storage for query efficiency.
-- Query 1: "get all messages for a user, newest first"
CREATE TABLE messages_by_user (
user_id UUID, sent_at TIMESTAMP, msg_id UUID, body TEXT,
PRIMARY KEY ((user_id), sent_at, msg_id)
) WITH CLUSTERING ORDER BY (sent_at DESC);
-- Query 2: "get all messages in a conversation, newest first"
CREATE TABLE messages_by_conversation (
conv_id UUID, sent_at TIMESTAMP, msg_id UUID, body TEXT,
PRIMARY KEY ((conv_id), sent_at, msg_id)
) WITH CLUSTERING ORDER BY (sent_at DESC);Both tables store the same message — written to twice, stored twice. Reads are single-partition, sequential, and fast on both.
Rule 3: Bound your partition size.
An unbounded partition grows forever and becomes a hot spot. A messages_by_user table where one user sends 10 billion messages creates a partition that no single node can serve efficiently.
Fix: add a time bucket to the partition key to cap partition size.
PRIMARY KEY ((user_id, bucket), sent_at, msg_id)
-- bucket = YYYY-MM (month bucket) or YYYY-WW (week bucket)
-- each partition holds at most one month of messages per userHot Partition Problem
A hot partition is a partition key that receives disproportionately more traffic than others — one node becomes the bottleneck regardless of cluster size. This is the most common production performance problem in Cassandra.
Causes:
- Partition key with low cardinality (e.g.,
status = 'pending'— all pending orders in one partition) - A “celebrity” entity (a popular user, a viral post) whose partition receives orders of magnitude more reads/writes
- Time-based partition key with current time bucket receiving all writes
Mitigation strategies:
| Strategy | How | When to use |
|---|---|---|
| Random suffix (write-time salting) | Append a random suffix to partition key: user_id + random(0,N) → write to N partitions | Write hot spots; reads must scatter-gather across all N partitions |
| Time bucketing | Split monotonic time key into (entity_id, YYYY-MM) or (entity_id, YYYY-WW) | Unbounded append-only partitions (time-series, logs) |
| Composite partition key | Add a dimension to spread keys: (region, user_id) instead of just user_id | When a secondary dimension provides natural spread |
| Pre-computed aggregates | Instead of reading a hot partition, write aggregates to a summary table | High-read hot partitions (leaderboards, counters) |
Secondary indexes in Cassandra (CREATE INDEX) scatter queries across all nodes — every node is queried and results are merged. They appear convenient but are a performance trap for high-cardinality columns or high-traffic queries. Use materialized views or explicit denormalized tables instead.
When to Use Cassandra
| Fits well | Poor fit |
|---|---|
| Time-series data (IoT, metrics, events) | Ad-hoc queries (unknown WHERE clauses at schema time) |
| Write-heavy workloads (millions of writes/sec) | Multi-entity ACID transactions |
| Append-only logs and activity feeds | Complex joins across entities |
| Multi-datacenter active-active replication | Small datasets (operational overhead not justified) |
| Linear horizontal scale with no single point of failure | Frequent deletes (tombstone accumulation) |
Interview tip: When the question is “how would you store a billion messages a day?”, I’d reach for Cassandra and immediately explain the schema is query-driven: “I’d design one table per query pattern — messages_by_user partitioned on user_id, messages_by_conversation partitioned on conv_id — and accept that I’m storing each message twice. Denormalization isn’t a workaround in Cassandra, it’s the design.” I’d bound partition size with a time bucket like (user_id, year_month) to avoid unbounded partitions, run LOCAL_QUORUM reads and writes for strong consistency within a DC at one-replica-failure tolerance, and explicitly call out the hot-partition problem as the most common production failure mode — fixing it with random suffixing or sharded counters when needed. I’d avoid Cassandra secondary indexes; they scatter queries across every node and become a performance trap.
Test Your Understanding
A Cassandra table is partitioned by user_id. One user has 50 million messages. Reads for this user time out. What’s wrong and how do you fix it?
Unbounded partition. All 50M messages live in a single partition (one node). Reading requires loading all rows into memory — Cassandra partitions should ideally stay under 100MB / 100K rows.
Fix: Add a time bucket to the partition key: PRIMARY KEY ((user_id, year_month), created_at). Each partition holds at most one month. A query for recent messages hits one small partition; full history requires scatter-gather across month buckets and client-side merge.
You write with CL=ONE and read with CL=ONE (RF=3). Can a read return stale data?
Always possible. R + W = 2, which is ≤ N (3) — the quorum rule isn’t satisfied. The write may go to Node A, the read to Node B, and B hasn’t replicated yet.
Even after replication, LWW (Last Write Wins — Cassandra resolves conflicts by timestamp) means a clock-skewed node could overwrite a newer value with an older one.
For strong consistency within a DC: Use LOCAL_QUORUM for both reads and writes. With RF=3, that requires 2/3 replicas — guaranteed overlap between read and write sets.
You model a ’likes’ counter as a regular column and increment it with SET likes = likes + 1. Under concurrent writes, increments are lost. Why?
Cassandra has no read-modify-write atomicity. The client reads likes=100, computes 101, writes 101. Two concurrent clients both read 100 and both write 101 — one increment lost.
Fix: Use a sharded counter pattern. Write increments to N random partition shards (likes:post_123:shard_0 through shard_9). Read by summing all shards. This distributes write contention and eliminates the lost-update problem. Cassandra’s built-in counter columns exist but have eventual consistency issues under contention.
A Cassandra cluster has RF=3 with NetworkTopologyStrategy across 3 DCs (RF=3 per DC). You write with LOCAL_QUORUM. The entire US DC goes down. Is data lost?
No data is lost. With RF=3 per DC, each DC has 3 independent replicas. The US DC is down but EU and APAC each have all the data.
However, LOCAL_QUORUM only counts replicas in the local DC. Clients in the US DC can’t write (no local replicas to form quorum) and must be rerouted to another DC. Clients in EU/APAC continue operating normally.
If RF=3 total (1 per DC), losing the US DC means 1 of 3 replicas is lost. Data survives on the other 2, but LOCAL_QUORUM in the US fails — no local replicas at all.