Skip to content

How to Design a Recommendation System That Actually Works

DS
LDS Team
Let's Data Science
31 minAudio
Listen Along
0:00/ 0:00
AI voice

Netflix credits its recommendation engine with saving over **$1 billion** annually in reduced churn. TikTok built a global empire not on social graphs but on an algorithmic feed that predicts your taste better than you can describe it. Behind both platforms sits the same fundamental architecture: a multi-stage funnel that narrows millions of items to a handful of perfect suggestions in under 200 milliseconds.

This guide walks through how to design that system from scratch. We'll build a movie and show recommendation platform (think Netflix or Spotify) that serves 100 million daily active users, covering every layer from data ingestion to the final ranked feed. The focus is on the architecture decisions that separate production systems from academic prototypes.

System Requirements and Scale Estimation

Every recommendation system starts with constraints. For our streaming platform, here are the numbers that shape every architectural choice.

RequirementValueImplication
Daily Active Users100M~50,000 QPS at peak
Content Library10M titlesToo large to score exhaustively per request
P99 Latency< 200msMust separate fast retrieval from slow ranking
Interaction Logs~500GB/dayRequires distributed streaming (Kafka), not direct DB writes
Availability99.99%Feed must load even with stale recommendations

Key Insight: The 200ms latency budget is the single constraint that forces the funnel architecture. You cannot run a deep neural network over 10 million items in 200ms. You must filter first, then rank.

At this scale, the architecture splits into three layers: an online serving layer (sub-200ms responses), a nearline layer (minute-level feature updates via Flink or Spark Streaming), and an offline training layer (daily or weekly model retraining on GPU clusters).

High-level recommendation system architecture showing online serving, nearline, and offline training layersClick to expandHigh-level recommendation system architecture showing online serving, nearline, and offline training layers

The Funnel Architecture: Retrieval, Ranking, Re-Ranking

The funnel is the most important concept in recommendation system design. It works like a hiring pipeline: a broad sourcing pass, then increasingly selective interviews.

Stage 1: Retrieval (Candidate Generation)

Retrieval eliminates 99.95% of the catalog in under 10ms. The goal is high recall, not precision. Missing a good title here means it never reaches the user.

Common retrieval strategies run in parallel and merge their candidates:

StrategyHow It WorksStrengths
Two-Tower EmbeddingsEncode user and item into the same vector space, find nearest neighborsCaptures collaborative signals, fast ANN lookup
Content-Based FilteringMatch item features (genre, director, actors) to user preferencesWorks for new items with no interaction history
Popularity-BasedSurface trending or globally popular titlesReliable cold-start fallback
Graph-BasedRandom walks on user-item interaction graphs (like Twitter's GraphJet)Discovers indirect connections

Each source contributes ~100-500 candidates, producing a merged pool of roughly 500-1,000 titles.

Stage 2: Ranking (Scoring)

The ranker applies a heavier model to the candidate pool. Since it only scores ~500 items (not 10 million), it can afford complex feature interactions and cross-attention.

Production rankers typically use architectures like Wide & Deep or DLRM (Deep Learning Recommendation Model). The wide component memorizes specific user-item associations ("users who watched Movie A also clicked Movie B"), while the deep component generalizes across features ("sci-fi fans tend to enjoy tech documentaries").

The ranking stage also combines multiple objectives:

Score=w1P(Click)+w2P(Complete)+w3P(Like)w4P(Regret)\text{Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{Complete}) + w_3 \cdot P(\text{Like}) - w_4 \cdot P(\text{Regret})

Where:

  • P(Click)P(\text{Click}) is the predicted click-through probability
  • P(Complete)P(\text{Complete}) is the probability the user finishes the title
  • P(Like)P(\text{Like}) is the probability of an explicit positive signal (thumbs up, add to list)
  • P(Regret)P(\text{Regret}) is the probability the user feels the recommendation was poor
  • w1,w2,w3,w4w_1, w_2, w_3, w_4 are business-tuned weights from A/B tests

In Plain English: If you only optimize for clicks, you get clickbait thumbnails. If you only optimize for watch time, you get addictive but low-quality content. The weighted combination balances immediate engagement with long-term user satisfaction. YouTube learned this the hard way and now uses Multi-gate Mixture-of-Experts (MMoE) to train separate expert networks for each objective.

Stage 3: Re-Ranking (Business Logic)

The final stage applies rules that no ML model should learn implicitly:

  • Diversity: Don't show 10 crime dramas in a row. Force category variety.
  • Freshness: Boost newly released content to give it initial signal.
  • Deduplication: Remove already-watched titles.
  • Content policy: Filter restricted content based on user profile.

The output is the final 10-20 titles displayed in the user's feed.

Funnel architecture showing items narrowing from 10M through retrieval, ranking, and re-ranking stagesClick to expandFunnel architecture showing items narrowing from 10M through retrieval, ranking, and re-ranking stages

Two-Tower Models and Embedding Retrieval

The two-tower model is the workhorse of modern retrieval. It encodes users and items into the same dense vector space, where proximity indicates relevance. The concept builds directly on the principles covered in Text Embeddings Explained.

Architecture

One neural network (the user tower) processes user features (watch history, demographics, session context) into a d-dimensional vector. A separate network (the item tower) processes item features (title, genre, cast, description) into a vector of the same dimensionality.

The similarity score between a user uu and item vv is their dot product:

S(u,v)=uTv=i=1duiviS(u, v) = u^T v = \sum_{i=1}^{d} u_i \cdot v_i

Where:

  • uRdu \in \mathbb{R}^d is the user embedding vector
  • vRdv \in \mathbb{R}^d is the item embedding vector
  • dd is the embedding dimension (typically 64-256)

In Plain English: Both towers map their inputs into a shared geometric space. A user who loves Christopher Nolan thrillers ends up near Inception, Tenet, and The Prestige in this space. The dot product measures alignment: if both vectors point in the same direction, the score is high.

The critical advantage: item embeddings are computed offline and indexed. Only the user tower runs at request time, producing a query vector that searches against the pre-built index. This is what makes retrieval fast.

Training with Negative Sampling

The model trains on positive pairs (user watched a title) and negative pairs (user did not interact). Without negatives, the model collapses to predicting everything as relevant.

StrategyDescriptionTrade-off
Random NegativesSample random items as negativesEasy to implement, but too easy for the model
In-Batch NegativesOther users' positives become your negativesFree compute, industry standard at Google and Pinterest
Hard NegativesItems the user almost engaged with but didn'tStrong learning signal, can destabilize training
Mixed (80/20)80% random + 20% hard negativesBest balance in practice

Pro Tip: In-batch negatives introduce popularity bias because frequently interacted items appear more often as negatives. Correct for this by log-scaling the sampling probability: items with more interactions get downweighted as negatives.

With 10 million item vectors indexed, brute-force search (O(N)O(N)) is too slow. Production systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small accuracy loss for O(logN)O(\log N) search time.

TechnologyLatencyBest For
Faiss (Meta)< 1msSelf-hosted, billions of vectors, best performance
ScaNN (Google)< 1msCPU-optimized, excellent accuracy/speed trade-off
Milvus / Pinecone~5-10msManaged services, faster time-to-production

HNSW (Hierarchical Navigable Small World) is the most common index type. It builds a multi-layer graph where the top layers provide coarse navigation (highways) and bottom layers provide fine-grained search (local roads). Pinterest uses HNSW to search billions of image embeddings in milliseconds.

The Feature Store: Why Pre-Computation Wins

During a 200ms request, you cannot compute "user's average watch time over 30 days" from raw logs. The feature store solves this by pre-computing aggregate features offline and serving them from Redis at request time.

Feature CategoryExampleUpdate Frequency
User long-termTop 3 genres, average session lengthHourly batch
User short-termLast 5 watched titles, current session genre mixNear real-time via Kafka
Item staticGenre, cast, director, release yearOn content ingestion
Item dynamicTrending score, recent CTR, completion rateHourly batch
Cross featuresUser-genre affinity score, time-of-day preferencesHourly batch

Common Pitfall: The most frequent production failure isn't a bad model. It's a broken data pipeline. A stalled Kafka consumer means stale features, which means yesterday's preferences drive today's recommendations. Monitor feature freshness as aggressively as model accuracy.

This infrastructure pattern mirrors what Retrieval-Augmented Generation (RAG) systems use: pre-computed embeddings stored in vector databases for fast retrieval at inference time.

Cold Start Strategies

Cold start is the Achilles' heel of collaborative filtering. New users have no watch history. New titles have no engagement data. Both need special handling.

New User Strategies

StrategyHow It WorksWhen to Use
Global PopularityShow trending titlesDefault fallback, always works
Demographic SimilarityMatch users by age, region, languageWhen registration data is available
Onboarding QuizAsk explicit genre/mood preferencesWhen user friction is acceptable
Exploration via BanditsThompson Sampling balances showing known-good vs. discovering preferencesBest theoretical approach, used at Netflix and Spotify

New Item Strategies

StrategyHow It WorksWhen to Use
Content-Based FeaturesGenerate embeddings from title, description, cast using the item towerImmediate, no interactions needed
Creator TransferNew title inherits its director's/studio's audience profileWorks well for sequels and established creators
Exploration BudgetReserve 5-10% of impressions for new contentGathers signal fast, slight quality trade-off

Key Insight: Cold start isn't one problem; it's a decision tree. A new user watching a new item is the hardest case, requiring pure content-based signals. A returning user encountering a new item can blend collaborative and content approaches. Design your fallback chain explicitly.

Evaluation: Offline Metrics and Online A/B Tests

You cannot improve what you cannot measure. Recommendation systems require both offline evaluation (before deployment) and online A/B testing (after deployment). For a deeper treatment of evaluation methodology, see The Ultimate Guide to ML Metrics.

Offline Metrics

MetricWhat It MeasuresTarget
Recall@KOf all relevant items, what fraction appears in the top K?> 0.5 for retrieval
NDCG@KRanking quality, penalizing relevant items placed too low> 0.7 for ranking
AUCClick prediction accuracy across all thresholds> 0.8
CoverageWhat fraction of the catalog gets recommended?> 30% (prevents popularity collapse)

Online Metrics

MetricWhat It MeasuresWhy It Matters
Watch TimeTotal minutes consumedNorth star for video platforms
D7 RetentionUsers returning after 7 daysLong-term platform health
DiversityUnique categories per feedPrevents filter bubbles
CTRClick-through rateLeading indicator, but gameable with clickbait

Common Pitfall: High offline NDCG does not guarantee improved online watch time. The offline evaluation uses historical data where user behavior was shaped by the old model. Always A/B test. Netflix maintains long-term holdout groups even after shipping a change, and they distrust suspiciously large effect sizes, which usually indicate bugs rather than breakthroughs.

LLM-Augmented Recommendations (March 2026)

The integration of large language models into recommendation systems is the most significant architectural shift since two-tower models. As of early 2026, three patterns have emerged in production systems.

LLMs as item encoders. Instead of training custom content encoders from scratch, teams are feeding item metadata (titles, descriptions, reviews) through pre-trained LLM embedding models. The resulting vectors capture semantic nuance that tag-based systems miss entirely. A documentary about "the psychological toll of competitive chess" and one about "mental health in professional gaming" land close together in embedding space, even though they share zero tags. Research frameworks like LEARN employ LLM-powered user and item encoders within two-tower structures for richer representations.

Conversational discovery. Spotify's DJ feature and Netflix's experimental chat interfaces let users describe what they want in natural language: "something light and funny for a rainy Sunday." An LLM interprets intent and maps it to retrieval queries. This complements the algorithmic feed rather than replacing it.

Reasoning over user history. LLMs can analyze a user's watch history and generate explanations: "Because you watched three space documentaries this week, you might enjoy this new series about the Mars mission." This adds interpretability that pure embedding models lack.

Pro Tip: LLMs are too slow and expensive for the retrieval stage (50K QPS on 10M items). The production pattern is using traditional two-tower models for high-throughput retrieval, then using LLMs for re-ranking the final 20-50 candidates or powering conversational features that sit alongside the main feed.

LLM-augmented recommendation pipeline showing traditional retrieval with LLM-powered re-ranking and conversational layerClick to expandLLM-augmented recommendation pipeline showing traditional retrieval with LLM-powered re-ranking and conversational layer

When to Use Each Approach

Not every product needs the full architecture described above. The right design depends on your scale and signal density.

ScaleSignal DensityRecommended Architecture
< 100K usersLow (e-commerce, infrequent visits)Content-based filtering + popularity baseline
100K - 10M usersMediumTwo-tower retrieval + lightweight ranker (XGBoost or small DNN)
10M - 100M usersHigh (streaming, social)Full funnel with ANN retrieval, deep ranker, re-ranking rules
100M+ usersVery high (TikTok-scale)Add real-time training (TikTok's Monolith), sequential models (SASRec), MMoE multi-objective ranking

When NOT to Use Collaborative Filtering

Collaborative filtering breaks down in specific scenarios. Recognizing these saves months of engineering:

  1. Extreme sparsity. If 99.9%+ of user-item pairs have no interaction, matrix factorization produces meaningless embeddings. Switch to content-based.
  2. Rapidly changing catalog. If items expire quickly (news, flash sales), item embeddings go stale before enough users interact. Use recency-weighted features instead.
  3. Privacy constraints. If you cannot store user interaction history, collaborative filtering is impossible. Content-based with session-only signals is the alternative.

Conclusion

Building a recommendation system that works at scale comes down to one architectural insight: separate fast approximate retrieval from slow precise ranking. The two-tower model handles retrieval by encoding users and items into the same vector space, ANN indexes like HNSW make that search sub-millisecond, and deep rankers like Wide & Deep score the surviving candidates against multiple business objectives.

The infrastructure matters as much as the algorithms. A feature store (Redis) pre-computes user signals so the serving layer never touches raw logs. Kafka ingests billions of daily events that flow through Spark into training pipelines. Model serving on GPU clusters (Triton) handles 50K QPS with tight latency budgets. When something breaks in production, it's almost always the data pipeline, not the model.

For readers building their first system, start with the feature selection fundamentals that underpin effective feature engineering, then move to cross-validation techniques for evaluating your models offline before committing to A/B tests. The gap between a demo and a production system is mostly infrastructure and monitoring, not smarter algorithms.

Frequently Asked Interview Questions

Q: Why do recommendation systems use a multi-stage funnel instead of a single model?

Scoring 10 million items with a complex deep neural network would take seconds, far exceeding the 200ms latency budget. The funnel uses a fast, lightweight retrieval model to narrow 10 million items to ~500 candidates, then applies an expensive ranking model only to those survivors. This is the same trade-off between recall and precision you see in search engines.

Q: Explain the two-tower architecture and why the towers are separate.

The user tower and item tower are separate neural networks that produce embeddings in the same vector space. Separation is key because item embeddings can be computed offline and indexed once, while the user tower runs online to capture current session context. This asymmetry makes retrieval fast: you only compute one user vector per request and search against a pre-built index.

Q: How would you handle a new user with no watch history?

Start with global popularity (trending titles), then layer in demographic signals (region, language, device) if available. Use Thompson Sampling to balance exploitation of known-good content with exploration to learn the new user's preferences. After 5-10 interactions, switch to the standard collaborative filtering pipeline. The key is having an explicit fallback chain, not a single strategy.

Q: What's the difference between in-batch negatives and hard negatives?

In-batch negatives reuse other users' positive examples as your negatives within the same training batch. It's computationally free and works well at scale (used by Google, YouTube, Pinterest). Hard negatives are items the user almost engaged with but didn't, providing a stronger learning signal but potentially destabilizing training. Most production systems use a mix: 80% in-batch, 20% hard negatives.

Q: A model improved offline NDCG by 5% but online watch time dropped. What happened?

Offline metrics use historical data shaped by the previous model's behavior, creating a distribution mismatch with live traffic. The new model might rank items that look good on paper but perform poorly when actually shown to users. This is why A/B testing is non-negotiable. Also check for position bias: the offline evaluation might not account for users' tendency to click whatever appears first.

Q: How do you prevent filter bubbles in a recommendation system?

Apply diversity constraints in the re-ranking stage, forcing the final feed to include titles from at least N different categories. Use exploration budgets (5-10% of impressions reserved for diverse content). Monitor coverage metrics: if 1% of your catalog gets 90% of impressions, popularity bias is dominating. Some teams also use Determinantal Point Processes (DPPs) to maximize diversity while maintaining relevance.

Q: Where do LLMs fit in a recommendation pipeline as of 2026?

LLMs are too slow and expensive for retrieval (50K QPS). They fit in three places: as item encoders that generate richer content embeddings offline, as re-rankers that score the final 20-50 candidates with deeper contextual understanding, and as conversational interfaces for natural language discovery ("something funny for a rainy Sunday"). The two-tower retrieval and deep ranking models still handle the high-throughput core.

Q: How would you estimate the infrastructure cost for a 100M-user recommendation system?

The major cost drivers are GPU model serving ($5K-10K/month for Triton clusters), event ingestion ($3K-5K/month for Kafka), and the feature store ($2K-4K/month for Redis). Total is roughly $17K-37K/month on AWS. The biggest optimization lever is model quantization (FP32 to INT8), which cuts GPU serving costs by 50-70%. At our scale, self-hosted Faiss saves ~$30K/month over managed vector databases like Pinecone.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths