The difference between a successful platform and a ghost town often comes down to one algorithm: the recommendation system. Netflix estimates its recommendation engine saves $1 billion annually by reducing churn. TikTok built a global empire not on social graphs, but on an algorithmic feed that knows you better than you know yourself.
But building a system that scales to 100 million users while returning results in under 200 milliseconds is a massive engineering challenge. It requires balancing mathematical rigor with heavy infrastructure constraints.
This guide moves beyond simple collaborative filtering tutorials. We will design a production-grade recommendation system capable of serving video content at the scale of YouTube or Netflix. We will cover the "funnel" architecture, vector databases, two-tower neural networks, and the critical trade-offs that senior engineers face daily.
1. Clarifying Requirements
In a system design interview or a real-world scoping meeting, you never start coding immediately. You start by defining the boundaries.
Interviewer: "Design a recommendation system for a video streaming platform."
Candidate: "Let's narrow the scope. Are we designing the homepage feed (discovery) or the 'Up Next' sidebar (related content)?"
Interviewer: "Focus on the homepage feed. We want users to discover new content they'll love."
Candidate: "What is the scale? Are we a startup or a tech giant?"
Interviewer: "Assume we are at scale. 100 million daily active users (DAU) and 10 million videos."
Candidate: "What are the latency constraints? Does the feed update in real-time as I click things?"
Interviewer: "Latency must be under 200ms. The feed should refresh near real-time, but slightly delayed updates (minutes) are acceptable for user profiles."
Candidate: "How do we handle new users with no watch history?"
Interviewer: "Good question. We need a fallback strategy for cold start scenarios."
System Requirements Summary
Functional Requirements:
- Generate Feed: Users see a personalized list of videos upon login.
- Record Interactions: System tracks clicks, likes, watches, and "not interested" signals.
- Handle Cold Start: System must handle new users and new videos gracefully.
Non-Functional Requirements (The Numbers):
- DAU: 100 Million.
- Content Library: 10 Million videos.
- QPS (Queries Per Second): Assuming 5 visits/user/day roughly distributed, peak traffic could hit ~50,000 QPS.
- Latency: P99 latency < 200ms.
- Availability: 99.99% (The feed must always load, even if recommendations are slightly stale).
- Storage:
- Metadata: 10M videos × 1KB ≈ 10GB (Small).
- Interaction Logs: 100M users × 50 events/day × 100 bytes ≈ 500GB/day. This is the heavy lifter.
🎯 Interview Tip: Always calculate QPS and storage on the whiteboard. For 500GB/day, you immediately know you need a distributed log system like Kafka, not a direct database write.
2. High-Level Architecture
At 50,000 QPS, you cannot simply query a database for "videos user X likes." You need a decoupled architecture that separates Online Serving (fast) from Offline Training (slow).
The Recommendation Service is the brain. It doesn't do the heavy lifting itself; it orchestrates the retrieval and ranking steps.
3. Data Model & Storage
Data is the fuel for recommendation engines. We need to store three types of data efficiently.
1. User & Video Metadata (Relational)
We need fast random access to profile data and video details. Database: PostgreSQL or Cassandra (if write-heavy).
User: user_id, age, location, language, device_type.Video: video_id, title, tags, duration, uploader_id, upload_time.
2. User Interaction Logs (Time-Series / Stream)
Every click, scroll, and hover is a signal. Storage: Apache Kafka (ingestion) → Parquet on S3 (long-term).
{
"event_id": "evt_98765",
"user_id": "u_12345",
"video_id": "v_555",
"event_type": "watch_75_percent",
"timestamp": 1672531200,
"context": {"connection": "4g", "device": "iphone13"}
}
3. Derived Features (Key-Value Store)
During inference, we can't compute "average watch time last 30 days" on the fly. We pre-compute these features. Storage: Redis or Cassandra.
- Key:
u_12345 - Value:
{"avg_watch_time": 450s, "top_genre": "scifi", "recent_clicks": [...]}
⚠️ Common Pitfall: Beginners often query the raw interaction logs during the recommendation request. This kills latency. Always use a Feature Store (like Redis) with pre-aggregated stats for real-time inference.
4. The Funnel Architecture (Serving Pipeline)
This is the most critical concept in recommendation system design. You cannot rank 10 million videos for every user request—it's computationally impossible. Instead, we use a multi-stage funnel.
Stage 1: Retrieval (Candidate Generation)
Goal: Quickly eliminate 99.99% of irrelevant videos. Method: Collaborative Filtering, Matrix Factorization, or Two-Tower Embeddings. Output: ~500 raw candidates.
Stage 2: Ranking (Scoring)
Goal: Precision. Order the 500 candidates by probability of engagement. Method: Complex Deep Learning models (e.g., Wide & Deep, DLRM) using dense features. Output: Sorted list of 500 videos with scores.
Stage 3: Re-Ranking (Business Logic)
Goal: Optimization and Policy. Logic:
- Diversity: Don't show 10 cat videos in a row.
- Freshness: Boost newly uploaded content.
- Fairness: Ensure creator diversity.
- Deduplication: Remove videos already watched. Output: Final 10-20 videos sent to the user.
5. Core Algorithm: The Two-Tower Model
Modern retrieval uses "Two-Tower" Neural Networks to create embeddings. An embedding is a vector (a list of numbers, e.g., [0.1, -0.5, 0.8]) that represents the "essence" of a user or video.
Architecture
The Math
The similarity score is usually the dot product of the user vector and video vector :
To turn this into a probability (e.g., probability of a click), we often use the Sigmoid function:
In Plain English: The "Two Towers" effectively map users and videos into the same geometric space. The model learns to place a user close to the videos they will like. The dot product is just a mathematical ruler measuring "how aligned are these two vectors?" If the user vector points North and the video vector points North, the score is high. If they point in opposite directions, the score is low.
🎯 Interview Tip: Why two separate towers? Because the Video Tower can be run offline. We compute vectors for all 10M videos once and store them. The User Tower runs online (real-time) to capture their latest mood, producing a query vector to search against the stored video vectors.
Training: Loss Function & Negative Sampling
The model learns from positive pairs (user clicked video) and negative pairs (user did NOT click). Without negatives, the model would think every video is relevant.
Loss Function: Binary Cross-Entropy
Where:
- for positive interactions (clicks, watches)
- for negative samples
- = sigmoid function
- = dot product of user and item embeddings
In Plain English: The loss function penalizes the model when it predicts "high relevance" for videos the user ignored, and "low relevance" for videos the user loved. Over millions of examples, the model learns to distinguish good recommendations from bad ones.
Negative Sampling Strategies
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Random Negatives | Sample random videos as negatives | Simple, fast | Too easy—model doesn't learn edge cases |
| Hard Negatives | Use videos user almost clicked but didn't | Strong learning signal | Can destabilize training |
| In-Batch Negatives | Other users' positives become your negatives | Computationally efficient | Popularity bias (popular items over-sampled) |
| Mixed Strategy | Combine 80% random + 20% hard | Balanced learning | More complex to implement |
🎯 Interview Tip: When discussing negative sampling, mention that "in-batch negatives" is the industry standard at scale (used by Google, YouTube, Pinterest) because it's computationally free—you reuse the other positive examples in the same training batch as negatives.
Training Pipeline Architecture
How do embeddings get updated as user behavior changes?
Key Decisions:
- User embeddings: Update frequently (hourly or real-time) to capture current session intent.
- Video embeddings: Update daily (content doesn't change as fast as user mood).
- Full retraining: Weekly, to incorporate new videos and decaying old patterns.
6. Deep Dive: Vector Search (The Retrieval Engine)
Once we have 10 million video vectors, how do we find the 500 closest ones to our user?
A standard loop for video in all_videos: compute_score() is too slow ().
We use Approximate Nearest Neighbor (ANN) algorithms.
Technology Choices
| Technology | Latency | Scale | Best For |
|---|---|---|---|
| Faiss (Meta) | <1ms | Billions | High-performance, self-hosted clusters. |
| ScaNN (Google) | <1ms | Billions | State-of-the-art accuracy/speed trade-off on CPUs. |
| Pinecone/Milvus | ~10ms | 100M+ | Managed services if you don't want to maintain infrastructure. |
How It Works (HNSW Index)
HNSW (Hierarchical Navigable Small World) builds a multi-layered graph. It's like a highway system for data.
- Start at the top layer (interstates) to get to the general neighborhood.
- Drop down to local roads to find the exact destination.
- This reduces search from to .
📊 Real-World Example: Pinterest uses HNSW to search billions of images. When you click a photo, they convert it to a vector and find visually similar images in milliseconds using this graph traversal.
7. Ranking Architecture (The Precision Layer)
After retrieval gives us ~500 candidates, we need precision. The ranking model is heavier because it only runs on 500 items, not 10 million.
Model: Deep Learning Recommendation Model (DLRM) or Wide & Deep.
- Wide Part: Memorizes specific interactions (e.g., "User 123 clicked Video 456"). Good for exceptions.
- Deep Part: Generalizes patterns (e.g., "Sci-Fi fans usually like Tech reviews"). Good for exploration.
Feature Crossing
We don't just feed raw data; we combine features.
User_Country×Video_Language(Does the user speak the video's language?)Time_of_Day×Video_Category(Cartoons in the morning, Horror at night?)
🔑 Key Insight: The ranking stage is where you optimize for business objectives. You might have one model predicting "Probability of Click" (CTR) and another predicting "Probability of Completion" (CVR).
If you only optimize for clicks, you get clickbait. If you optimize for watch time, you get quality.
For a deeper dive into the statistical metrics behind these predictions, check out our guide on The Bias-Variance Tradeoff.
8. API Design
The API needs to be simple but informative for the client.
GET /v1/recommendations/feed?user_id=u_12345&limit=10
Response:
{
"items": [
{
"video_id": "v_555",
"title": "System Design Interview Guide",
"thumbnail_url": "...",
"score": 0.98,
"source": "retrieval_algo_v2",
"tracking_token": "token_abc123"
},
...
],
"next_page_token": "page_2_xyz",
"latency_ms": 145
}
🎯 Interview Tip: Always include a
tracking_tokenin the response. When the user clicks the video, the client sends this token back. This allows the backend to join the "Click" event with the exact model version and features that generated the recommendation, which is crucial for debugging and training.
9. Cold Start Handling
The cold start problem is one of the most common interview questions for recommendation systems. It occurs when we have no interaction history for a user or an item.
New User Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Global Popularity | Show trending/most-watched videos | Simple, always works | No personalization |
| Demographic Similarity | Find similar users by age/location | Some personalization | Privacy concerns |
| Onboarding Quiz | Ask explicit preferences | High-quality signal | User friction |
| Exploration | Show diverse content, learn fast | Builds profile quickly | Initial experience may be poor |
New Item Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Content Features | Use title, tags, thumbnail embeddings | Immediate availability | Ignores taste |
| Creator Transfer | New video inherits creator's audience | Leverages existing data | Unfair to new creators |
| Freshness Boost | Temporarily boost new content score | Ensures visibility | May show low-quality content |
| Exploration Budget | Reserve 5-10% of impressions for new items | Gathers signal fast | Slight quality hit |
🎯 Interview Tip: When asked about cold start, don't just say "use popularity." Show you understand the trade-off: popularity works but creates a "rich get richer" problem. Mention exploration/exploitation (Multi-Armed Bandits, Thompson Sampling) to show depth.
10. Evaluation Metrics
You cannot improve what you cannot measure. Recommendation systems require BOTH offline and online metrics.
Offline Metrics (Before Deployment)
These are computed on historical data before pushing a model to production.
| Metric | Formula | What It Measures | Target |
|---|---|---|---|
| Precision@K | (Relevant in Top K) / K | Of the K shown, how many were relevant? | > 0.3 |
| Recall@K | (Relevant in Top K) / Total Relevant | Of ALL relevant items, how many did we find? | > 0.5 |
| NDCG@K | DCG / IDCG | Ranking quality (rewards good items at top positions) | > 0.7 |
| AUC | Area Under ROC | Click prediction accuracy | > 0.8 |
Online Metrics (A/B Testing in Production)
These are the real business metrics measured after deployment.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| CTR | Click-through rate | Immediate engagement signal |
| Watch Time | Total minutes watched | Quality of recommendations |
| Completion Rate | % of videos watched to end | Content-match quality |
| D1/D7 Retention | Users returning after 1/7 days | Long-term platform health |
| Diversity | Unique categories in feed | Prevents filter bubbles |
⚠️ Common Pitfall: Optimizing only for CTR leads to clickbait. Optimizing only for watch time leads to addictive content. The best systems use a weighted combination:
11. Trade-offs & Alternatives
Every design decision has a cost. Here are the major ones for this architecture.
Architectural Decisions
| Decision | Option A | Option B | We Chose | Why |
|---|---|---|---|---|
| Embedding Update | Real-time | Batch (Daily) | Hybrid | User embeddings update near real-time (to catch immediate interests), Video embeddings update daily (content changes slowly). |
| Database | SQL (Postgres) | NoSQL (Cassandra) | Both | Postgres for user metadata (ACID compliance), Cassandra/DynamoDB for interaction logs (massive write throughput). |
| Serving | Pre-compute (Cache) | On-the-fly | On-the-fly | Pre-computing fails for active users whose context changes fast. We only cache the "Head" (popular) queries. |
| Model Format | TensorFlow SavedModel | ONNX | ONNX | Framework-agnostic, works with Triton Inference Server, easier to optimize. |
| Vector DB | Self-hosted Faiss | Managed Pinecone | Faiss on K8s | Cost savings at scale. Pinecone gets expensive at 10M+ vectors. |
Model Architecture Trade-offs
| Approach | Latency | Accuracy | Cold Start | Interpretability |
|---|---|---|---|---|
| Matrix Factorization | Very Fast | Good | Poor | Medium |
| Two-Tower DNN | Fast | Very Good | Good | Low |
| Transformer (Sequential) | Slow | Excellent | Excellent | Very Low |
| Graph Neural Network | Medium | Very Good | Good | Low |
🎯 Interview Tip: When asked "why not use Transformers everywhere?", explain the latency constraint. Transformers are in sequence length. For retrieval over 10M items, even a small Transformer per item is prohibitive. We use Transformers only in ranking (500 items) or for encoding text features offline.
Case Study: Netflix vs. TikTok
| Feature | Netflix | TikTok |
|---|---|---|
| Context | "Lean Back" (TV, 2-hour movies) | "Lean Forward" (Mobile, 15s clips) |
| Signal Density | Low (1 movie decision per night) | Extremely High (100 swipes per session) |
| Exploration | Low Risk (Users stick to safe choices) | High Risk (Must show new viral trends instantly) |
| Architecture | Heavy reliance on long-term history | Heavy reliance on immediate session context (RNNs/Transformers) |
Netflix focuses on accuracy because starting a bad 2-hour movie is painful. TikTok focuses on speed and adaptation because skipping a bad 15-second video is costless.
12. Conclusion
Designing a recommendation system at scale is about managing the funnel. You start with millions of items, filter them down with fast approximate algorithms (retrieval), and then carefully rank the survivors with heavy deep learning models.
Key Takeaways for Interviews
-
Always start with requirements. Clarify scale, latency, and cold start before drawing boxes.
-
Separate retrieval from ranking. This is the most important architectural decision. Retrieval prioritizes recall (fast, approximate). Ranking prioritizes precision (slow, accurate).
-
Don't forget cold start. New users and new items break collaborative filtering. Always have a fallback (popularity, content-based features, exploration).
-
Measure both offline AND online metrics. High offline NDCG doesn't guarantee high online watch time. Always A/B test.
-
Negative sampling matters. Without negatives, the model thinks everything is relevant. In-batch negatives are the industry standard.
The "secret sauce" isn't just the algorithm—it's the infrastructure that feeds it:
- Kafka for real-time data ingestion at 500GB/day.
- Vector Databases (Faiss) for sub-millisecond search over 10M embeddings.
- Feature Stores (Redis) to provide instant user context without querying raw logs.
- Model Serving (Triton) for low-latency GPU inference at 50K QPS.
Further Reading
To master the concepts powering these models:
- Model Evaluation: Read our guide on Cross-Validation vs. The "Lucky Split".
- Infrastructure: Check out AWS vs GCP vs Azure for Machine Learning.
- Metrics: Read Why 99% Accuracy Can Be a Disaster.
Practice Problems
-
Trending Now: Design a "Trending Now" section. Does it fit into the main funnel, or is it a separate cache?
-
Diversity vs. Relevance: How would you ensure users see content from at least 5 different categories? Where in the pipeline do you enforce this?
-
Creator Fairness: Small creators complain their videos never get recommended. How would you address this without hurting engagement metrics?
-
Real-Time Personalization: A user just watched 3 cooking videos in a row. How fast can you update their recommendations?
These questions test whether you understand not just the "what" but the "why" behind each architectural decision.