Document Stores (MongoDB)
MongoDB stores data as BSON documents in schemaless collections. The central design decision is not which queries to support — it is whether to embed related data inside a document or reference it from a separate collection. Getting that decision right eliminates most of the pain that comes from using MongoDB at scale.
Document Model
A document is a BSON object — a superset of JSON with typed fields.
{
"_id": ObjectId("65a3f2c1e4b09d3f1a2b3c4d"), // 12-byte: timestamp + machine + pid + counter
"user_id": 42,
"email": "alice@example.com",
"created_at": ISODate("2024-01-14T10:00:00Z"), // BSON Date, not a string
"address": { // embedded sub-document
"street": "123 Main St",
"city": "Seattle",
"zip": "98101"
},
"tags": ["premium", "beta"], // array
"balance": NumberDecimal("9999.99") // Decimal128 — exact for money
}BSON types that matter: ObjectId (auto-generated unique ID with embedded timestamp), Date (millisecond precision, not string), NumberDecimal (exact decimal for financial values — use this, not double), BinData (binary blobs).
Embedding vs Referencing
This is the core schema decision. There is no JOIN in MongoDB — related data must either be embedded in the same document or fetched with a separate query.
| Scenario | Embed | Reference |
|---|---|---|
| Data always fetched together | ✅ | ❌ (extra round trip) |
| Sub-document has bounded growth | ✅ | — |
| Sub-document grows unboundedly | ❌ (16 MB document limit) | ✅ |
| Data shared by many documents | ❌ (duplicated across all) | ✅ (one copy, referenced) |
| Sub-document queried independently | ❌ (must load parent) | ✅ |
| Write contention on sub-document | ❌ (all writers contend on parent) | ✅ |
Example — embed comments in a post:
// Good if comments are bounded (e.g., max 100 per post)
{ "_id": 1, "title": "Post", "comments": [ {"author": "bob", "text": "..."} ] }
// Bad if a post can have 100,000 comments — document hits 16 MB limit
// Fix: separate "comments" collection, reference post_idThe embedding trap: When you embed an array expecting it to stay small and it grows without bound, the document eventually exceeds MongoDB’s 16 MB limit — at which point you must migrate to a reference model under load.
Indexes
MongoDB builds indexes on a mongod collection level. All index types are B-trees except geospatial (2d/2dsphere use geohash).
| Index type | Declaration | Behavior |
|---|---|---|
| Single-field | {field: 1} | Sort ascending; -1 for descending |
| Compound | {a: 1, b: -1} | Leftmost prefix rule identical to SQL compound indexes |
| Multikey | {tags: 1} | Automatically created when field is an array; one index entry per array element |
| Text | {body: "text"} | Tokenized full-text; supports $text queries with relevance scoring |
| 2dsphere | {loc: "2dsphere"} | GeoJSON points/polygons; $near, $geoWithin, $geoIntersects |
| Hashed | {user_id: "hashed"} | Used for hash-based sharding; equality only, no range |
| Wildcard | {"$**": 1} | Index all fields or a subtree; useful for dynamic/user-defined schemas |
| Partial | {status: 1} with partialFilterExpression | Index only documents matching a filter — smaller, faster |
| TTL | {created_at: 1} with expireAfterSeconds | Automatically delete documents after TTL — useful for sessions, logs |
Multikey index gotcha: A compound index cannot have more than one multikey field. {tags: 1, categories: 1} fails if both are arrays. Also, a multikey index on an array field cannot be used as a shard key.
Covered query: If all projected fields are in the index, MongoDB returns results directly from the index without touching the collection — identical to PostgreSQL’s index-only scan.
// Index: { user_id: 1, status: 1, created_at: 1 }
db.orders.find(
{ user_id: 42, status: "pending" },
{ _id: 0, created_at: 1 } // project only indexed field
)
// → "indexOnly: true" in explain() — no document fetch
Aggregation Pipeline
The aggregation pipeline transforms documents through a sequence of stages. Each stage receives documents from the previous stage and passes its output to the next.
db.orders.aggregate([
{ $match: { status: "completed", created_at: { $gte: ISODate("2024-01-01") } } },
{ $lookup: { from: "customers", localField: "customer_id",
foreignField: "_id", as: "customer" } },
{ $unwind: "$customer" },
{ $group: { _id: "$customer.region",
total_revenue: { $sum: "$amount" },
order_count: { $sum: 1 } } },
{ $sort: { total_revenue: -1 } },
{ $limit: 10 },
{ $project: { _id: 0, region: "$_id", total_revenue: 1, order_count: 1 } }
])Key stages:
| Stage | Purpose |
|---|---|
$match | Filter documents — place early to reduce pipeline input; uses indexes if first stage |
$group | Aggregate: $sum, $avg, $min, $max, $push (array), $addToSet (distinct array) |
$lookup | Left outer join to another collection — expensive without an index on the foreign field |
$unwind | Deconstruct an array field into one document per element |
$project | Reshape documents: include/exclude/rename/compute fields |
$addFields | Add computed fields without removing existing ones |
$sort + $limit | Top-N pattern — MongoDB optimizes $sort + $limit as a single top-N heap |
$facet | Run multiple sub-pipelines on the same input — for faceted search results |
$bucket | Group documents into ranges (histogram) |
$out / $merge | Write pipeline results to a collection — for pre-computed aggregations |
$lookup performance trap: $lookup performs an in-memory join. Without an index on the foreignField, it scans the entire foreign collection for every input document — an O(n × m) operation. Always create an index on the foreign field before using $lookup in production.
Memory limit: Aggregation pipelines have a 100 MB memory limit per stage. Add { allowDiskUse: true } for large aggregations, or use $out/$merge to materialize intermediate results.
Replication: Replica Set
A replica set is a group of mongod instances that maintain the same data. One node is the primary (accepts writes); the others are secondaries (replicate from primary).
┌──────────────┐
Writes ─► Primary │◄── Heartbeat (every 2s)
└──────┬───────┘
oplog │ async replication
┌────────┴────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│Secondary │ │Secondary │ (or Arbiter — votes but holds no data)
└──────────┘ └──────────┘Oplog (operations log): A capped collection on the primary that records every write as an idempotent operation. Secondaries tail the oplog and replay operations to stay in sync. The oplog window (how far back a secondary can lag before needing full resync) is determined by oplog size — typically configured to hold 24–72 hours of writes.
Automatic failover:
- Primary becomes unreachable (heartbeat misses)
- Secondaries detect failure after
electionTimeoutMillis(default 10s) - An eligible secondary calls an election; majority vote required to become primary
- Clients with replica set connection strings auto-discover the new primary
Read preferences:
| Preference | Reads from | Use case |
|---|---|---|
primary (default) | Primary only | Strong consistency — always latest data |
primaryPreferred | Primary; secondary if primary unavailable | Slight availability improvement |
secondary | Any secondary | Offload reads; accepts replication lag |
secondaryPreferred | Secondary; primary if no secondary available | Maximize read scale |
nearest | Lowest latency node | Geo-distributed reads; may read stale data |
Write concern:
db.orders.insertOne(doc, { writeConcern: { w: "majority", j: true, wtimeout: 5000 } })
// w: "majority" — acknowledged by majority of voting nodes
// j: true — journaled (fsynced) before acknowledgement
// wtimeout — timeout in ms; returns error if not satisfied in time
| Write concern | Durability | Latency |
|---|---|---|
w: 0 | None — fire and forget | Minimal |
w: 1 | Primary ack only — lost if primary fails before replicating | Low |
w: "majority" | Survives primary failure — majority has the write | Medium |
j: true | Fsynced to journal — survives process crash | Adds fsync latency |
Change streams: Applications can subscribe to a real-time stream of changes (inserts, updates, deletes) on a collection, database, or cluster. Built on top of the oplog; provides a resumable cursor using a resume token — if the consumer disconnects, it can resume from where it left off.
const stream = db.orders.watch([{ $match: { "fullDocument.status": "shipped" } }]);
stream.on("change", change => console.log(change.fullDocument));Sharding
Sharding horizontally partitions a collection across multiple shard servers. Each shard is itself a replica set.
mongos (query router)
/ | \
Shard 1 Shard 2 Shard 3 ← each shard is a replica set
(RS: 3 nodes) (RS: 3) (RS: 3)
\ | /
Config Servers
(RS: stores metadata,
chunk→shard mapping)Shard Key Selection
The shard key determines how documents are distributed. It is immutable after collection creation — a bad shard key requires re-sharding the entire collection.
Three properties to evaluate:
| Property | Goal | Bad example | Good example |
|---|---|---|---|
| Cardinality | High — enough distinct values to fill all chunks | status (3 values → max 3 shards) | user_id (millions of distinct values) |
| Frequency | Even — no single value dominates | country_code (50% US traffic → US chunk is hot) | user_id (roughly even distribution) |
| Monotonic rate of change | Avoid — always-increasing keys route all writes to one chunk | created_at, ObjectId | hashed(user_id), (user_id, created_at) |
Range sharding: Documents with adjacent shard key values are co-located in the same chunk. Supports efficient range queries on the shard key. Risk: monotonically increasing keys (ObjectId, timestamp) create insert hotspots on the last chunk.
Hashed sharding: MongoDB hashes the shard key value before distributing. Writes are evenly spread across shards. Range queries on the shard key require scatter-gather (fan out to all shards).
// Range sharding — supports range queries, risk of hotspot on monotonic key
sh.shardCollection("db.orders", { customer_id: 1 })
// Hashed sharding — even distribution, no range query support on shard key
sh.shardCollection("db.orders", { customer_id: "hashed" })
// Compound shard key — common pattern: entity ID + time bucket
sh.shardCollection("db.events", { user_id: 1, day: 1 })Chunk Migration
MongoDB divides each shard’s key range into chunks (default 128 MB). The balancer automatically migrates chunks between shards to maintain even distribution.
Migration cost: Chunk migration moves data between shards over the network while still serving live traffic. Heavy migrations during peak traffic impact query latency. Schedule balancer windows during off-peak hours:
sh.setBalancerState(false) // disable balancer
sh.startBalancer() // re-enable
db.settings.updateOne(
{ _id: "balancer" },
{ $set: { activeWindow: { start: "01:00", stop: "05:00" } } }
)Scatter-Gather
If a query does not include the shard key, mongos fans it out to all shards and merges results. For a 10-shard cluster, one query becomes 10 queries. Scatter-gather is acceptable for infrequent admin queries but is a performance anti-pattern on hot paths.
Always include the shard key in query predicates on high-traffic paths.
Transactions
Single-document operations are always atomic — an update to a document with embedded sub-documents and arrays is atomic without any transaction overhead. This is why embedding related data (when appropriate) is preferable to referencing.
Multi-document transactions (MongoDB 4.0+):
const session = client.startSession();
session.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } });
try {
db.accounts.updateOne({ _id: A }, { $inc: { balance: -100 } }, { session });
db.accounts.updateOne({ _id: B }, { $inc: { balance: +100 } }, { session });
await session.commitTransaction();
} catch (e) {
await session.abortTransaction();
}Multi-document transactions use the same MVCC snapshot isolation as the replica set oplog. They add coordination overhead — prefer single-document atomicity via embedding where possible. Transactions across shards (distributed transactions) add further latency due to two-phase commit coordination across shard replica sets.