Recommendation System Design

Recommendation System Design

A user opens Netflix. From a catalog of 100,000 titles, the home page shows 30 recommendations — and 80% of what people watch comes from these recommendations. The system must decide which 30 of 100,000 in under 200ms, personalized to this specific user, against this specific time of day, on this specific device. Scoring all 100,000 titles with a heavy ML model on every page load is impossible — even at 1ms per scoring call, that’s 100 seconds. The fundamental architectural pattern that makes recommendation feasible is the two-stage funnel: retrieval narrows the corpus, ranking sorts the survivors.

The Two-Stage Architecture

flowchart LR
    Corpus[(Item corpus
100K – 100M items)] Corpus -->|retrieval
~10ms
cheap features| Cands[Candidates
~500 items] Cands -->|ranking
~50ms
rich features| Ranked[Ranked top-K
~30 items] Ranked -->|re-ranking
diversity, business rules| Final[Final list
shown to user]
StageInput sizeOutput sizeModel complexityLatency budgetOptimizes for
Retrieval (candidate gen)Millions of items~500–1000Cheap: ANN lookup, embedding similarity, simple rules<20msRecall — don’t miss good items
Ranking~500 candidates~30–100Heavy: gradient-boosted trees or DNN with hundreds of features<100msPrecision — order them well
Re-ranking / policy~30 ranked~30 finalRules + light models<10msDiversity, freshness, fairness, business constraints

Why two stages? A heavy ranking model with rich features (user history, item metadata, context) is too slow to apply to millions of items. A cheap retrieval model can shortlist reasonable candidates in milliseconds, leaving the heavy model to do precise ordering on a small set. This is the same divide-and-conquer pattern as search: an inverted index does retrieval, BM25 + learning-to-rank does ordering.

Retrieval Stage

The job of retrieval is to take a user (and context) and produce a few hundred candidate items from a corpus that may be millions or billions of items, with high recall and very low latency.

Approach 1: Two-Tower Embedding Model (Modern Default)

Train a neural network with two towers — one that maps users to a 128-dimensional embedding, another that maps items to the same 128-dimensional space — such that the dot product of (user, item) embeddings predicts engagement (click, watch, purchase).

flowchart TB
    subgraph "Training (offline)"
        UF[User features
history, demographics] --> UT[User Tower
DNN] IF[Item features
category, text, image] --> IT[Item Tower
DNN] UT -->|user embedding
128-dim| Dot[Dot product] IT -->|item embedding
128-dim| Dot Dot -->|score| Loss[Loss vs. label
clicked / not clicked] end subgraph "Serving (online)" UReq[User request] --> UTServe[User tower
~5ms] UTServe -->|user embedding| ANN[ANN index
FAISS / ScaNN] Items[(All item embeddings
pre-computed offline)] --> ANN ANN -->|top-500 nearest items
~10ms| Cand[Candidates] end

Why this is fast:

  • Item embeddings are pre-computed offline for all 100M items and loaded into an Approximate Nearest Neighbor (ANN) index — a few GB in memory.
  • At request time, only the user tower is run (one forward pass, ~5ms).
  • The user embedding is then used to query the ANN index for the top-500 nearest items in ~5–10ms.
  • Total: ~15ms instead of running 100M dot products.

Approximate Nearest Neighbor (ANN) algorithms — HNSW (Hierarchical Navigable Small World), IVF (Inverted File), product quantization — give 95%+ recall at 100–1000× the speed of exact search.

ANN LibraryApproachBest For
FAISS (Meta)IVF + product quantizationLargest scale, GPU-friendly, batch search
ScaNN (Google)Anisotropic quantizationBest recall-vs-speed trade-off for inner product
HNSW (open source)Graph-based navigationBest latency for moderate corpora; used in Pinecone, Weaviate, OpenSearch
Annoy (Spotify)Random projection treesSimple, file-based, easy to rebuild

Approach 2: Multiple Retrieval Sources (Production Reality)

In production, retrieval is rarely one model — it’s a union of multiple sources, each capturing a different signal:

Final candidate set = union of:
  - Two-tower ANN: ~300 items (personalized to user history)
  - Collaborative filtering: ~100 items ("users who liked X also liked Y")
  - Trending / popular: ~50 items (cold-start safety net)
  - Content-based: ~50 items (similar to items user recently engaged with)
  - Editorial / business rules: ~20 items (curated, sponsored, must-show)

Deduplicate → ~500 unique candidates → send to ranker

This redundancy is intentional — each source has different failure modes, and the ranker decides which actually win.

Ranking Stage

Given ~500 candidates, score each one with a model that uses many more features than retrieval could afford.

Feature Categories

CategoryExamplesWhere it comes from
User featuresAge, country, language, subscription tier, account age, device typeUser profile DB / feature store
User historyLast 100 items watched, average watch time, genre preferences, time-of-day patternsFeature store (computed offline + streaming)
Item featuresCategory, tags, popularity, freshness (hours since published), creator, embeddingItem metadata + offline batch
Context featuresTime of day, day of week, current device, network speed, page surfaceRequest payload
User × item interactions“Has user watched anything from this creator?”, “How many items in this category has user clicked this week?”Computed at request time from user + item features
Cross featurescategory × time_of_day, device × content_typeFeature engineering

Model Choices

ModelStrengthsWeaknessesUsed by
Gradient-boosted trees (XGBoost, LightGBM)Strong baselines, handle mixed dense/sparse features, fast inference, easy to debugDon’t easily use raw embeddings; harder to scale beyond ~1000 featuresMany production rankers; common starting point
Deep & Wide (Google)Combines memorization (wide linear) with generalization (deep DNN)More infra; longer trainingGoogle Play, YouTube
DLRM (Meta)Embeddings for sparse features + DNN; designed for ranking at scaleComplex to trainMeta Ads, Instagram
Transformer-based (BERT4Rec, SASRec)Capture sequential user behaviorHigher latency; need careful servingTikTok, modern recommenders

Production starting point: Gradient-boosted trees on hundreds of features. Simple to operate, strong baseline, easy to interpret. Move to DNN only when GBT plateaus.

Training Objective: Click-Through Rate Is Not Enough

A naive model trained on predict(clicked) will recommend clickbait — content that gets clicks but disappoints. Real systems use multi-objective learning:

Final score = w₁ × P(click) + w₂ × P(complete watch) + w₃ × P(rating ≥ 4) - w₄ × P(skip in <10s)

Each P(...) is a head of a multi-task model. Weights are tuned via A/B tests against a long-term north-star metric (subscriber retention, revenue, daily active users) — not the proxy metric the model directly optimizes.

⚠️

Beware proxy metrics. Optimizing CTR aggressively often reduces long-term engagement because clickbait erodes user trust. Always validate model improvements against a north-star metric (D7 retention, sessions per week) over a long enough A/B window — typically 2–4 weeks.

Re-ranking and Business Rules

Even a perfectly trained ranker outputs a list that may need adjustments before showing to the user:

AdjustmentWhyExample
DiversityAll-similar list feels boring and risky (one bad recommendation looks like all bad)MMR (Maximal Marginal Relevance) — penalize items too similar to already-selected items
FreshnessBoost newly added content to seed engagement signalMultiplicative boost decaying over 7 days from publish
Already-seen filteringDon’t re-recommend items the user has already consumedFilter against user’s recent watch history
Fairness / exposureLong-tail items need impressions to gather signal; small creators need reachInject N items per page from underexposed slate
Business rulesSponsored content, regional licensing, age restrictionsHard filter for licensing; soft slot for promoted content
Position bias correctionSlot 1 gets clicks regardless of quality; training labels are biasedInverse propensity weighting in training

End-to-End Request Flow

sequenceDiagram
    participant U as User
    participant API as Reco API
    participant FS as Feature Store
    participant ANN as ANN Index
(item embeddings) participant UT as User Tower
Service participant R as Ranker participant RR as Re-ranker U->>API: GET /home (user_id=42, device=mobile) API->>FS: Fetch user features + recent history FS-->>API: features API->>UT: Compute user embedding UT-->>API: user_emb (128-dim) API->>ANN: Top-500 nearest items to user_emb ANN-->>API: 500 candidate item_ids API->>FS: Fetch item features for 500 candidates FS-->>API: item features (batch) API->>R: Score 500 candidates R-->>API: scored list API->>RR: Apply diversity, freshness, business rules RR-->>API: top-30 API-->>U: 30 recommendations

Latency budget for ~200ms total:

  • Feature fetch (user + 500 items): ~30ms (parallel batched reads)
  • User tower forward pass: ~5ms
  • ANN lookup: ~10ms
  • Ranker scoring 500 items: ~50ms (batched)
  • Re-ranking: ~5ms
  • Network + serialization: ~50ms

A/B Testing and Online Evaluation

Offline metrics (AUC, NDCG) only correlate weakly with real impact. Every model change must be validated by an online A/B test.

Experiment Setup

Hash(user_id) % 100 →
  bucket 0–49:  control (current model)
  bucket 50–99: treatment (new model)

Run for at least 1–2 weeks
Track: CTR, watch time, D1/D7 retention, revenue
Decide: ship if treatment wins on north-star metric AND no regression on guardrails

Common Pitfalls

PitfallWhat goes wrongMitigation
Novelty effectUsers click new things just because they’re new; effect fades after 1–2 weeksRun experiments for ≥2 weeks; look at trends, not just averages
Selection biasOnly engaged users see new model often enough to reactStratify by activity tier; report effect per tier
Network effectsTreatment users influence control users (e.g., “trending” lists shared across buckets)Cluster-randomized experiments — bucket by household, region, or community
Multiple comparisonsRun 50 experiments, ~2-3 will look significant by chanceBonferroni correction or sequential testing (e.g., always-valid p-values)
Guardrail metricsModel improves CTR but tanks watch-completionMonitor a basket of metrics; require no regression on key guardrails
Sample ratio mismatchTreatment bucket has 51% of users, not 50% — invalidates comparisonRun SRM check before reading any results

Statistical Significance

For a binary metric (CTR), required sample size to detect a relative lift Δ at baseline p:

n ≈ 16 × p × (1-p) / (Δ × p)²   per arm

Example: baseline CTR = 5%, want to detect 2% relative lift (5.0% → 5.1%)
n ≈ 16 × 0.05 × 0.95 / (0.001)² ≈ 760,000 users per arm

Practical takeaway: detecting small but real lifts requires millions of impressions.

Cold Start

The hardest production problem in recommendation. Two flavors:

New User Cold Start

A user with no interaction history breaks the personalization model — there’s no behavior to learn from.

StrategyHow it worksWhen to use
Popular / trendingShow globally popular items as defaultAlways — strong fallback for the first session
Onboarding surveyAsk user to pick 3–5 interests on signupHigh-friction surfaces; high-value users
Demographic priorsUse country, age, language to seed preferencesLow-friction; respect privacy
Contextual banditTreat first sessions as exploration; arms = content categoriesWhen you have enough cold-start traffic to learn
Implicit signalsUse referrer, device, time of signup as weak featuresAlways — costs nothing

New Item Cold Start

A newly uploaded item has no clicks, watches, or ratings — collaborative filtering can’t place it.

StrategyHow it works
Content-based featuresUse the item’s text, image, audio embeddings (e.g., CLIP, BERT) — these don’t require interaction data
Creator priorsNew item from a popular creator inherits a prior based on creator’s average performance
Forced explorationReserve a slot in every user’s feed for new items; bandit-style allocation to gather signal
Two-tower with content featuresItem tower includes text/image embeddings, so new items get reasonable embeddings before any interactions

Offline Pipeline

Real-time serving sits on top of a heavy offline data pipeline:

flowchart TB
    Logs[Click / Watch logs
Kafka] --> DL[(Data Lake
Parquet on S3)] DL --> FE[Feature Engineering
Spark / dbt] FE --> FS[(Feature Store
online + offline)] DL --> Train[Training Pipeline
nightly / weekly] FS --> Train Train --> MR[Model Registry
versioned models] MR -->|deploy| RankSvc[Ranker Service] MR -->|deploy| TwoTower[Two-Tower Service] TwoTower -->|export item
embeddings| ANNBuild[Build ANN Index] ANNBuild --> ANNServe[ANN Serving] Logs -->|streaming| FS

Refresh cadence:

ComponentCadenceWhy
Item embeddingsDaily–weeklyItems don’t change quickly; rebuild index nightly
User embeddingsReal-time or hourlyUser behavior shifts within a session — recent activity matters
Ranker modelWeeklyMulti-day training; A/B test before promotion
Trending lists5–15 minutesCaptures news cycles, viral content
User history featuresStreamingLast-action-aware features need sub-minute freshness

Production Examples

CompanyNotable approach
YouTubeTwo-tower retrieval + DNN ranker with multi-task heads (CTR, watch time, satisfaction surveys); session-based features
NetflixHeavy content-based features for cold start; row-level personalization (which row appears at top + what’s in it); offline-online consistency
TikTokSequential models on watch behavior; very fast feedback loop — model retrains on hours-old data; aggressive exploration for new content
SpotifyTwo-tower for retrieval; collaborative filtering for “Discover Weekly”; audio embeddings for cold-start track placement
Meta (Instagram, Reels)DLRM-style ranker; multi-objective with explicit user controls (close friends, “see less of this”)
Amazon“Customers who bought X also bought Y” was the original collaborative filtering at scale; now neural rankers with rich session features
ℹ️

Interview tip: When asked to design a recommendation system, lead with the two-stage architecture: “I’d split this into a retrieval stage that narrows millions of items to ~500 candidates in under 20ms — using a two-tower model where item embeddings are pre-computed offline and served from an ANN index like FAISS or ScaNN — followed by a heavy ranking stage that scores those 500 candidates with hundreds of features in ~50ms using a gradient-boosted tree or DNN. I’d train on multi-objective labels — clicks alone optimize for clickbait, so I’d combine click, dwell time, and explicit signals like ratings. Cold start is handled with content-based features for new items and popularity fallbacks for new users. Every change is validated by a 1–2 week A/B test against a north-star metric like retention, with guardrails on watch completion to catch clickbait regressions. Behind serving, a feature store keeps online and offline features consistent, and a daily training pipeline retrains the model with the latest interaction data.”

Test Your Understanding

Your recommendation system optimizes for click-through rate (CTR) and achieves 15% CTR in A/B testing — up from 10%. But average watch time drops from 8 minutes to 3 minutes, and user retention falls 5% over the next month. What went wrong?

You optimized for the wrong metric. High CTR with low watch time means the model learned to recommend clickbait — thumbnails and titles that attract clicks but disappoint users. Users click more but watch less and eventually churn.

Fix: Use a multi-objective loss function that combines CTR, watch time, and explicit signals (likes, shares, “not interested”). Weight the objectives to reflect long-term user value:

score = 0.2 * P(click) + 0.5 * E(watch_time) + 0.3 * P(positive_engagement)

Guardrail metrics: In every A/B test, track watch completion rate and 7-day retention as guardrails. If the primary metric (CTR) improves but guardrails regress, reject the experiment. TikTok and YouTube both use multi-objective ranking with retention as the north-star guardrail.

A brand-new user opens your app for the first time. They have zero interaction history. Your collaborative filtering model can’t generate any signal because it relies on user-item co-occurrence. What do you recommend, and how do you quickly escape the cold start?

Cold start strategy (layered):

  1. Immediate (0 interactions): Show globally popular items, filtered by coarse signals available at signup (country, language, device type, referral source). If the user came from a cooking blog, bias toward food content.
  2. After 1-5 interactions: Use content-based features. If the user watched a Python tutorial, recommend other programming content using item embeddings (not user history).
  3. After 10-20 interactions: Collaborative filtering kicks in — enough co-occurrence signal to find similar users.

Accelerate cold start with explicit onboarding: Ask the user to select 3-5 interests on first launch (Netflix, Spotify, Pinterest all do this). These selections seed the user embedding directly, skipping the “blind popularity” phase.

The two-tower model handles cold start elegantly: the user tower uses available features (demographics, device) even without history, producing a reasonable embedding that retrieves content-based candidates from the item tower’s ANN index.

Your two-tower retrieval model pre-computes item embeddings offline and stores them in a FAISS index. A new item is uploaded at 3 PM. It won’t appear in recommendations until the nightly embedding rebuild at 2 AM. How do you reduce this cold-start-for-items latency?

Three approaches, in increasing complexity:

  1. Popularity injection. Maintain a “new items” pool. For a configurable window (e.g., 24 hours), randomly insert new items into the retrieval candidates at a low rate (1-5% of slots). Gather interaction data, then fold into the next embedding rebuild.

  2. Online item embedding. Compute the new item’s embedding at upload time using the item tower model (content features: title, description, category, image embeddings). Insert it into the FAISS index immediately. The embedding is approximate (no collaborative signal yet) but better than nothing.

  3. Incremental index updates. Use a FAISS IndexIVF with add() support, or use ScaNN’s online serving mode. New items are added to the index within minutes of upload. A full rebuild runs nightly to optimize the index structure.

Option 2 is the standard production approach — content features give a reasonable initial embedding, and collaborative signal improves it over subsequent rebuilds.