Netflix credits its recommendation engine with saving over **$1 billion** annually in reduced churn. TikTok built a global empire not on social graphs but on an algorithmic feed that predicts your taste better than you can describe it. Behind both platforms sits the same fundamental architecture: a multi-stage funnel that narrows millions of items to a handful of perfect suggestions in under 200 milliseconds.
This guide walks through how to design that system from scratch. We'll build a movie and show recommendation platform (think Netflix or Spotify) that serves 100 million daily active users, covering every layer from data ingestion to the final ranked feed. The focus is on the architecture decisions that separate production systems from academic prototypes.
System Requirements and Scale Estimation
Every recommendation system starts with constraints. For our streaming platform, here are the numbers that shape every architectural choice.
| Requirement | Value | Implication |
|---|---|---|
| Daily Active Users | 100M | ~50,000 QPS at peak |
| Content Library | 10M titles | Too large to score exhaustively per request |
| P99 Latency | < 200ms | Must separate fast retrieval from slow ranking |
| Interaction Logs | ~500GB/day | Requires distributed streaming (Kafka), not direct DB writes |
| Availability | 99.99% | Feed must load even with stale recommendations |
Key Insight: The 200ms latency budget is the single constraint that forces the funnel architecture. You cannot run a deep neural network over 10 million items in 200ms. You must filter first, then rank.
At this scale, the architecture splits into three layers: an online serving layer (sub-200ms responses), a nearline layer (minute-level feature updates via Flink or Spark Streaming), and an offline training layer (daily or weekly model retraining on GPU clusters).
Click to expandHigh-level recommendation system architecture showing online serving, nearline, and offline training layers
The Funnel Architecture: Retrieval, Ranking, Re-Ranking
The funnel is the most important concept in recommendation system design. It works like a hiring pipeline: a broad sourcing pass, then increasingly selective interviews.
Stage 1: Retrieval (Candidate Generation)
Retrieval eliminates 99.95% of the catalog in under 10ms. The goal is high recall, not precision. Missing a good title here means it never reaches the user.
Common retrieval strategies run in parallel and merge their candidates:
| Strategy | How It Works | Strengths |
|---|---|---|
| Two-Tower Embeddings | Encode user and item into the same vector space, find nearest neighbors | Captures collaborative signals, fast ANN lookup |
| Content-Based Filtering | Match item features (genre, director, actors) to user preferences | Works for new items with no interaction history |
| Popularity-Based | Surface trending or globally popular titles | Reliable cold-start fallback |
| Graph-Based | Random walks on user-item interaction graphs (like Twitter's GraphJet) | Discovers indirect connections |
Each source contributes ~100-500 candidates, producing a merged pool of roughly 500-1,000 titles.
Stage 2: Ranking (Scoring)
The ranker applies a heavier model to the candidate pool. Since it only scores ~500 items (not 10 million), it can afford complex feature interactions and cross-attention.
Production rankers typically use architectures like Wide & Deep or DLRM (Deep Learning Recommendation Model). The wide component memorizes specific user-item associations ("users who watched Movie A also clicked Movie B"), while the deep component generalizes across features ("sci-fi fans tend to enjoy tech documentaries").
The ranking stage also combines multiple objectives:
Where:
- is the predicted click-through probability
- is the probability the user finishes the title
- is the probability of an explicit positive signal (thumbs up, add to list)
- is the probability the user feels the recommendation was poor
- are business-tuned weights from A/B tests
In Plain English: If you only optimize for clicks, you get clickbait thumbnails. If you only optimize for watch time, you get addictive but low-quality content. The weighted combination balances immediate engagement with long-term user satisfaction. YouTube learned this the hard way and now uses Multi-gate Mixture-of-Experts (MMoE) to train separate expert networks for each objective.
Stage 3: Re-Ranking (Business Logic)
The final stage applies rules that no ML model should learn implicitly:
- Diversity: Don't show 10 crime dramas in a row. Force category variety.
- Freshness: Boost newly released content to give it initial signal.
- Deduplication: Remove already-watched titles.
- Content policy: Filter restricted content based on user profile.
The output is the final 10-20 titles displayed in the user's feed.
Click to expandFunnel architecture showing items narrowing from 10M through retrieval, ranking, and re-ranking stages
Two-Tower Models and Embedding Retrieval
The two-tower model is the workhorse of modern retrieval. It encodes users and items into the same dense vector space, where proximity indicates relevance. The concept builds directly on the principles covered in Text Embeddings Explained.
Architecture
One neural network (the user tower) processes user features (watch history, demographics, session context) into a d-dimensional vector. A separate network (the item tower) processes item features (title, genre, cast, description) into a vector of the same dimensionality.
The similarity score between a user and item is their dot product:
Where:
- is the user embedding vector
- is the item embedding vector
- is the embedding dimension (typically 64-256)
In Plain English: Both towers map their inputs into a shared geometric space. A user who loves Christopher Nolan thrillers ends up near Inception, Tenet, and The Prestige in this space. The dot product measures alignment: if both vectors point in the same direction, the score is high.
The critical advantage: item embeddings are computed offline and indexed. Only the user tower runs at request time, producing a query vector that searches against the pre-built index. This is what makes retrieval fast.
Training with Negative Sampling
The model trains on positive pairs (user watched a title) and negative pairs (user did not interact). Without negatives, the model collapses to predicting everything as relevant.
| Strategy | Description | Trade-off |
|---|---|---|
| Random Negatives | Sample random items as negatives | Easy to implement, but too easy for the model |
| In-Batch Negatives | Other users' positives become your negatives | Free compute, industry standard at Google and Pinterest |
| Hard Negatives | Items the user almost engaged with but didn't | Strong learning signal, can destabilize training |
| Mixed (80/20) | 80% random + 20% hard negatives | Best balance in practice |
Pro Tip: In-batch negatives introduce popularity bias because frequently interacted items appear more often as negatives. Correct for this by log-scaling the sampling probability: items with more interactions get downweighted as negatives.
Approximate Nearest Neighbor Search
With 10 million item vectors indexed, brute-force search () is too slow. Production systems use Approximate Nearest Neighbor (ANN) algorithms that trade a small accuracy loss for search time.
| Technology | Latency | Best For |
|---|---|---|
| Faiss (Meta) | < 1ms | Self-hosted, billions of vectors, best performance |
| ScaNN (Google) | < 1ms | CPU-optimized, excellent accuracy/speed trade-off |
| Milvus / Pinecone | ~5-10ms | Managed services, faster time-to-production |
HNSW (Hierarchical Navigable Small World) is the most common index type. It builds a multi-layer graph where the top layers provide coarse navigation (highways) and bottom layers provide fine-grained search (local roads). Pinterest uses HNSW to search billions of image embeddings in milliseconds.
The Feature Store: Why Pre-Computation Wins
During a 200ms request, you cannot compute "user's average watch time over 30 days" from raw logs. The feature store solves this by pre-computing aggregate features offline and serving them from Redis at request time.
| Feature Category | Example | Update Frequency |
|---|---|---|
| User long-term | Top 3 genres, average session length | Hourly batch |
| User short-term | Last 5 watched titles, current session genre mix | Near real-time via Kafka |
| Item static | Genre, cast, director, release year | On content ingestion |
| Item dynamic | Trending score, recent CTR, completion rate | Hourly batch |
| Cross features | User-genre affinity score, time-of-day preferences | Hourly batch |
Common Pitfall: The most frequent production failure isn't a bad model. It's a broken data pipeline. A stalled Kafka consumer means stale features, which means yesterday's preferences drive today's recommendations. Monitor feature freshness as aggressively as model accuracy.
This infrastructure pattern mirrors what Retrieval-Augmented Generation (RAG) systems use: pre-computed embeddings stored in vector databases for fast retrieval at inference time.
Cold Start Strategies
Cold start is the Achilles' heel of collaborative filtering. New users have no watch history. New titles have no engagement data. Both need special handling.
New User Strategies
| Strategy | How It Works | When to Use |
|---|---|---|
| Global Popularity | Show trending titles | Default fallback, always works |
| Demographic Similarity | Match users by age, region, language | When registration data is available |
| Onboarding Quiz | Ask explicit genre/mood preferences | When user friction is acceptable |
| Exploration via Bandits | Thompson Sampling balances showing known-good vs. discovering preferences | Best theoretical approach, used at Netflix and Spotify |
New Item Strategies
| Strategy | How It Works | When to Use |
|---|---|---|
| Content-Based Features | Generate embeddings from title, description, cast using the item tower | Immediate, no interactions needed |
| Creator Transfer | New title inherits its director's/studio's audience profile | Works well for sequels and established creators |
| Exploration Budget | Reserve 5-10% of impressions for new content | Gathers signal fast, slight quality trade-off |
Key Insight: Cold start isn't one problem; it's a decision tree. A new user watching a new item is the hardest case, requiring pure content-based signals. A returning user encountering a new item can blend collaborative and content approaches. Design your fallback chain explicitly.
Evaluation: Offline Metrics and Online A/B Tests
You cannot improve what you cannot measure. Recommendation systems require both offline evaluation (before deployment) and online A/B testing (after deployment). For a deeper treatment of evaluation methodology, see The Ultimate Guide to ML Metrics.
Offline Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Recall@K | Of all relevant items, what fraction appears in the top K? | > 0.5 for retrieval |
| NDCG@K | Ranking quality, penalizing relevant items placed too low | > 0.7 for ranking |
| AUC | Click prediction accuracy across all thresholds | > 0.8 |
| Coverage | What fraction of the catalog gets recommended? | > 30% (prevents popularity collapse) |
Online Metrics
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Watch Time | Total minutes consumed | North star for video platforms |
| D7 Retention | Users returning after 7 days | Long-term platform health |
| Diversity | Unique categories per feed | Prevents filter bubbles |
| CTR | Click-through rate | Leading indicator, but gameable with clickbait |
Common Pitfall: High offline NDCG does not guarantee improved online watch time. The offline evaluation uses historical data where user behavior was shaped by the old model. Always A/B test. Netflix maintains long-term holdout groups even after shipping a change, and they distrust suspiciously large effect sizes, which usually indicate bugs rather than breakthroughs.
LLM-Augmented Recommendations (March 2026)
The integration of large language models into recommendation systems is the most significant architectural shift since two-tower models. As of early 2026, three patterns have emerged in production systems.
LLMs as item encoders. Instead of training custom content encoders from scratch, teams are feeding item metadata (titles, descriptions, reviews) through pre-trained LLM embedding models. The resulting vectors capture semantic nuance that tag-based systems miss entirely. A documentary about "the psychological toll of competitive chess" and one about "mental health in professional gaming" land close together in embedding space, even though they share zero tags. Research frameworks like LEARN employ LLM-powered user and item encoders within two-tower structures for richer representations.
Conversational discovery. Spotify's DJ feature and Netflix's experimental chat interfaces let users describe what they want in natural language: "something light and funny for a rainy Sunday." An LLM interprets intent and maps it to retrieval queries. This complements the algorithmic feed rather than replacing it.
Reasoning over user history. LLMs can analyze a user's watch history and generate explanations: "Because you watched three space documentaries this week, you might enjoy this new series about the Mars mission." This adds interpretability that pure embedding models lack.
Pro Tip: LLMs are too slow and expensive for the retrieval stage (50K QPS on 10M items). The production pattern is using traditional two-tower models for high-throughput retrieval, then using LLMs for re-ranking the final 20-50 candidates or powering conversational features that sit alongside the main feed.
Click to expandLLM-augmented recommendation pipeline showing traditional retrieval with LLM-powered re-ranking and conversational layer
When to Use Each Approach
Not every product needs the full architecture described above. The right design depends on your scale and signal density.
| Scale | Signal Density | Recommended Architecture |
|---|---|---|
| < 100K users | Low (e-commerce, infrequent visits) | Content-based filtering + popularity baseline |
| 100K - 10M users | Medium | Two-tower retrieval + lightweight ranker (XGBoost or small DNN) |
| 10M - 100M users | High (streaming, social) | Full funnel with ANN retrieval, deep ranker, re-ranking rules |
| 100M+ users | Very high (TikTok-scale) | Add real-time training (TikTok's Monolith), sequential models (SASRec), MMoE multi-objective ranking |
When NOT to Use Collaborative Filtering
Collaborative filtering breaks down in specific scenarios. Recognizing these saves months of engineering:
- Extreme sparsity. If 99.9%+ of user-item pairs have no interaction, matrix factorization produces meaningless embeddings. Switch to content-based.
- Rapidly changing catalog. If items expire quickly (news, flash sales), item embeddings go stale before enough users interact. Use recency-weighted features instead.
- Privacy constraints. If you cannot store user interaction history, collaborative filtering is impossible. Content-based with session-only signals is the alternative.
Conclusion
Building a recommendation system that works at scale comes down to one architectural insight: separate fast approximate retrieval from slow precise ranking. The two-tower model handles retrieval by encoding users and items into the same vector space, ANN indexes like HNSW make that search sub-millisecond, and deep rankers like Wide & Deep score the surviving candidates against multiple business objectives.
The infrastructure matters as much as the algorithms. A feature store (Redis) pre-computes user signals so the serving layer never touches raw logs. Kafka ingests billions of daily events that flow through Spark into training pipelines. Model serving on GPU clusters (Triton) handles 50K QPS with tight latency budgets. When something breaks in production, it's almost always the data pipeline, not the model.
For readers building their first system, start with the feature selection fundamentals that underpin effective feature engineering, then move to cross-validation techniques for evaluating your models offline before committing to A/B tests. The gap between a demo and a production system is mostly infrastructure and monitoring, not smarter algorithms.
Frequently Asked Interview Questions
Q: Why do recommendation systems use a multi-stage funnel instead of a single model?
Scoring 10 million items with a complex deep neural network would take seconds, far exceeding the 200ms latency budget. The funnel uses a fast, lightweight retrieval model to narrow 10 million items to ~500 candidates, then applies an expensive ranking model only to those survivors. This is the same trade-off between recall and precision you see in search engines.
Q: Explain the two-tower architecture and why the towers are separate.
The user tower and item tower are separate neural networks that produce embeddings in the same vector space. Separation is key because item embeddings can be computed offline and indexed once, while the user tower runs online to capture current session context. This asymmetry makes retrieval fast: you only compute one user vector per request and search against a pre-built index.
Q: How would you handle a new user with no watch history?
Start with global popularity (trending titles), then layer in demographic signals (region, language, device) if available. Use Thompson Sampling to balance exploitation of known-good content with exploration to learn the new user's preferences. After 5-10 interactions, switch to the standard collaborative filtering pipeline. The key is having an explicit fallback chain, not a single strategy.
Q: What's the difference between in-batch negatives and hard negatives?
In-batch negatives reuse other users' positive examples as your negatives within the same training batch. It's computationally free and works well at scale (used by Google, YouTube, Pinterest). Hard negatives are items the user almost engaged with but didn't, providing a stronger learning signal but potentially destabilizing training. Most production systems use a mix: 80% in-batch, 20% hard negatives.
Q: A model improved offline NDCG by 5% but online watch time dropped. What happened?
Offline metrics use historical data shaped by the previous model's behavior, creating a distribution mismatch with live traffic. The new model might rank items that look good on paper but perform poorly when actually shown to users. This is why A/B testing is non-negotiable. Also check for position bias: the offline evaluation might not account for users' tendency to click whatever appears first.
Q: How do you prevent filter bubbles in a recommendation system?
Apply diversity constraints in the re-ranking stage, forcing the final feed to include titles from at least N different categories. Use exploration budgets (5-10% of impressions reserved for diverse content). Monitor coverage metrics: if 1% of your catalog gets 90% of impressions, popularity bias is dominating. Some teams also use Determinantal Point Processes (DPPs) to maximize diversity while maintaining relevance.
Q: Where do LLMs fit in a recommendation pipeline as of 2026?
LLMs are too slow and expensive for retrieval (50K QPS). They fit in three places: as item encoders that generate richer content embeddings offline, as re-rankers that score the final 20-50 candidates with deeper contextual understanding, and as conversational interfaces for natural language discovery ("something funny for a rainy Sunday"). The two-tower retrieval and deep ranking models still handle the high-throughput core.
Q: How would you estimate the infrastructure cost for a 100M-user recommendation system?
The major cost drivers are GPU model serving ($5K-10K/month for Triton clusters), event ingestion ($3K-5K/month for Kafka), and the feature store ($2K-4K/month for Redis). Total is roughly $17K-37K/month on AWS. The biggest optimization lever is model quantization (FP32 to INT8), which cuts GPU serving costs by 50-70%. At our scale, self-hosted Faiss saves ~$30K/month over managed vector databases like Pinecone.