How to Design a Recommendation System That Actually Works

DS
LDS Team
Let's Data Science
16 min readAudio
How to Design a Recommendation System That Actually Works
0:00 / 0:00

The difference between a successful platform and a ghost town often comes down to one algorithm: the recommendation system. Netflix estimates its recommendation engine saves $1 billion annually by reducing churn. TikTok built a global empire not on social graphs, but on an algorithmic feed that knows you better than you know yourself.

But building a system that scales to 100 million users while returning results in under 200 milliseconds is a massive engineering challenge. It requires balancing mathematical rigor with heavy infrastructure constraints.

This guide moves beyond simple collaborative filtering tutorials. We will design a production-grade recommendation system capable of serving video content at the scale of YouTube or Netflix. We will cover the "funnel" architecture, vector databases, two-tower neural networks, and the critical trade-offs that senior engineers face daily.

1. Clarifying Requirements

In a system design interview or a real-world scoping meeting, you never start coding immediately. You start by defining the boundaries.

Interviewer: "Design a recommendation system for a video streaming platform."

Candidate: "Let's narrow the scope. Are we designing the homepage feed (discovery) or the 'Up Next' sidebar (related content)?"

Interviewer: "Focus on the homepage feed. We want users to discover new content they'll love."

Candidate: "What is the scale? Are we a startup or a tech giant?"

Interviewer: "Assume we are at scale. 100 million daily active users (DAU) and 10 million videos."

Candidate: "What are the latency constraints? Does the feed update in real-time as I click things?"

Interviewer: "Latency must be under 200ms. The feed should refresh near real-time, but slightly delayed updates (minutes) are acceptable for user profiles."

Candidate: "How do we handle new users with no watch history?"

Interviewer: "Good question. We need a fallback strategy for cold start scenarios."

System Requirements Summary

Functional Requirements:

  • Generate Feed: Users see a personalized list of videos upon login.
  • Record Interactions: System tracks clicks, likes, watches, and "not interested" signals.
  • Handle Cold Start: System must handle new users and new videos gracefully.

Non-Functional Requirements (The Numbers):

  • DAU: 100 Million.
  • Content Library: 10 Million videos.
  • QPS (Queries Per Second): Assuming 5 visits/user/day roughly distributed, peak traffic could hit ~50,000 QPS.
  • Latency: P99 latency < 200ms.
  • Availability: 99.99% (The feed must always load, even if recommendations are slightly stale).
  • Storage:
    • Metadata: 10M videos × 1KB ≈ 10GB (Small).
    • Interaction Logs: 100M users × 50 events/day × 100 bytes ≈ 500GB/day. This is the heavy lifter.

🎯 Interview Tip: Always calculate QPS and storage on the whiteboard. For 500GB/day, you immediately know you need a distributed log system like Kafka, not a direct database write.

2. High-Level Architecture

At 50,000 QPS, you cannot simply query a database for "videos user X likes." You need a decoupled architecture that separates Online Serving (fast) from Offline Training (slow).

Loading diagram...

The Recommendation Service is the brain. It doesn't do the heavy lifting itself; it orchestrates the retrieval and ranking steps.

3. Data Model & Storage

Data is the fuel for recommendation engines. We need to store three types of data efficiently.

Loading diagram...

1. User & Video Metadata (Relational)

We need fast random access to profile data and video details. Database: PostgreSQL or Cassandra (if write-heavy).

  • User: user_id, age, location, language, device_type.
  • Video: video_id, title, tags, duration, uploader_id, upload_time.

2. User Interaction Logs (Time-Series / Stream)

Every click, scroll, and hover is a signal. Storage: Apache Kafka (ingestion) → Parquet on S3 (long-term).

json
{
  "event_id": "evt_98765",
  "user_id": "u_12345",
  "video_id": "v_555",
  "event_type": "watch_75_percent",
  "timestamp": 1672531200,
  "context": {"connection": "4g", "device": "iphone13"}
}

3. Derived Features (Key-Value Store)

During inference, we can't compute "average watch time last 30 days" on the fly. We pre-compute these features. Storage: Redis or Cassandra.

  • Key: u_12345
  • Value: {"avg_watch_time": 450s, "top_genre": "scifi", "recent_clicks": [...]}

⚠️ Common Pitfall: Beginners often query the raw interaction logs during the recommendation request. This kills latency. Always use a Feature Store (like Redis) with pre-aggregated stats for real-time inference.

Loading diagram...

4. The Funnel Architecture (Serving Pipeline)

This is the most critical concept in recommendation system design. You cannot rank 10 million videos for every user request—it's computationally impossible. Instead, we use a multi-stage funnel.

Loading diagram...

Stage 1: Retrieval (Candidate Generation)

Goal: Quickly eliminate 99.99% of irrelevant videos. Method: Collaborative Filtering, Matrix Factorization, or Two-Tower Embeddings. Output: ~500 raw candidates.

Stage 2: Ranking (Scoring)

Goal: Precision. Order the 500 candidates by probability of engagement. Method: Complex Deep Learning models (e.g., Wide & Deep, DLRM) using dense features. Output: Sorted list of 500 videos with scores.

Stage 3: Re-Ranking (Business Logic)

Goal: Optimization and Policy. Logic:

  • Diversity: Don't show 10 cat videos in a row.
  • Freshness: Boost newly uploaded content.
  • Fairness: Ensure creator diversity.
  • Deduplication: Remove videos already watched. Output: Final 10-20 videos sent to the user.

5. Core Algorithm: The Two-Tower Model

Modern retrieval uses "Two-Tower" Neural Networks to create embeddings. An embedding is a vector (a list of numbers, e.g., [0.1, -0.5, 0.8]) that represents the "essence" of a user or video.

Architecture

Loading diagram...

The Math

The similarity score S(u,v)S(u, v) is usually the dot product of the user vector uu and video vector vv:

S(u,v)=i=1duivi=uTvS(u, v) = \sum_{i=1}^{d} u_i \cdot v_i = u^T v

To turn this into a probability (e.g., probability of a click), we often use the Sigmoid function:

P(clicku,v)=11+e(uTv)P(\text{click} | u, v) = \frac{1}{1 + e^{-(u^T v)}}

In Plain English: The "Two Towers" effectively map users and videos into the same geometric space. The model learns to place a user close to the videos they will like. The dot product is just a mathematical ruler measuring "how aligned are these two vectors?" If the user vector points North and the video vector points North, the score is high. If they point in opposite directions, the score is low.

🎯 Interview Tip: Why two separate towers? Because the Video Tower can be run offline. We compute vectors for all 10M videos once and store them. The User Tower runs online (real-time) to capture their latest mood, producing a query vector to search against the stored video vectors.

Training: Loss Function & Negative Sampling

The model learns from positive pairs (user clicked video) and negative pairs (user did NOT click). Without negatives, the model would think every video is relevant.

Loss Function: Binary Cross-Entropy

L=1Ni=1N[yilog(σ(uiTvi))+(1yi)log(1σ(uiTvi))]L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\sigma(u_i^T v_i)) + (1 - y_i) \log(1 - \sigma(u_i^T v_i)) \right]

Where:

  • yi=1y_i = 1 for positive interactions (clicks, watches)
  • yi=0y_i = 0 for negative samples
  • σ\sigma = sigmoid function
  • uiTviu_i^T v_i = dot product of user and item embeddings

In Plain English: The loss function penalizes the model when it predicts "high relevance" for videos the user ignored, and "low relevance" for videos the user loved. Over millions of examples, the model learns to distinguish good recommendations from bad ones.

Negative Sampling Strategies

StrategyDescriptionProsCons
Random NegativesSample random videos as negativesSimple, fastToo easy—model doesn't learn edge cases
Hard NegativesUse videos user almost clicked but didn'tStrong learning signalCan destabilize training
In-Batch NegativesOther users' positives become your negativesComputationally efficientPopularity bias (popular items over-sampled)
Mixed StrategyCombine 80% random + 20% hardBalanced learningMore complex to implement

🎯 Interview Tip: When discussing negative sampling, mention that "in-batch negatives" is the industry standard at scale (used by Google, YouTube, Pinterest) because it's computationally free—you reuse the other positive examples in the same training batch as negatives.

Training Pipeline Architecture

How do embeddings get updated as user behavior changes?

Loading diagram...

Key Decisions:

  • User embeddings: Update frequently (hourly or real-time) to capture current session intent.
  • Video embeddings: Update daily (content doesn't change as fast as user mood).
  • Full retraining: Weekly, to incorporate new videos and decaying old patterns.

6. Deep Dive: Vector Search (The Retrieval Engine)

Once we have 10 million video vectors, how do we find the 500 closest ones to our user? A standard loop for video in all_videos: compute_score() is too slow (O(N)O(N)).

We use Approximate Nearest Neighbor (ANN) algorithms.

Technology Choices

TechnologyLatencyScaleBest For
Faiss (Meta)<1msBillionsHigh-performance, self-hosted clusters.
ScaNN (Google)<1msBillionsState-of-the-art accuracy/speed trade-off on CPUs.
Pinecone/Milvus~10ms100M+Managed services if you don't want to maintain infrastructure.

How It Works (HNSW Index)

HNSW (Hierarchical Navigable Small World) builds a multi-layered graph. It's like a highway system for data.

  1. Start at the top layer (interstates) to get to the general neighborhood.
  2. Drop down to local roads to find the exact destination.
  3. This reduces search from O(N)O(N) to O(logN)O(\log N).

📊 Real-World Example: Pinterest uses HNSW to search billions of images. When you click a photo, they convert it to a vector and find visually similar images in milliseconds using this graph traversal.

7. Ranking Architecture (The Precision Layer)

After retrieval gives us ~500 candidates, we need precision. The ranking model is heavier because it only runs on 500 items, not 10 million.

Model: Deep Learning Recommendation Model (DLRM) or Wide & Deep.

  • Wide Part: Memorizes specific interactions (e.g., "User 123 clicked Video 456"). Good for exceptions.
  • Deep Part: Generalizes patterns (e.g., "Sci-Fi fans usually like Tech reviews"). Good for exploration.

Feature Crossing

We don't just feed raw data; we combine features.

  • User_Country × Video_Language (Does the user speak the video's language?)
  • Time_of_Day × Video_Category (Cartoons in the morning, Horror at night?)

🔑 Key Insight: The ranking stage is where you optimize for business objectives. You might have one model predicting "Probability of Click" (CTR) and another predicting "Probability of Completion" (CVR).

Final Score=w1P(Click)+w2P(WatchTime)\text{Final Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{WatchTime})

If you only optimize for clicks, you get clickbait. If you optimize for watch time, you get quality.

For a deeper dive into the statistical metrics behind these predictions, check out our guide on The Bias-Variance Tradeoff.

8. API Design

The API needs to be simple but informative for the client.

http
GET /v1/recommendations/feed?user_id=u_12345&limit=10

Response:
{
  "items": [
    {
      "video_id": "v_555",
      "title": "System Design Interview Guide",
      "thumbnail_url": "...",
      "score": 0.98,
      "source": "retrieval_algo_v2",
      "tracking_token": "token_abc123" 
    },
    ...
  ],
  "next_page_token": "page_2_xyz",
  "latency_ms": 145
}

🎯 Interview Tip: Always include a tracking_token in the response. When the user clicks the video, the client sends this token back. This allows the backend to join the "Click" event with the exact model version and features that generated the recommendation, which is crucial for debugging and training.

9. Cold Start Handling

The cold start problem is one of the most common interview questions for recommendation systems. It occurs when we have no interaction history for a user or an item.

Loading diagram...

New User Strategies

StrategyHow It WorksProsCons
Global PopularityShow trending/most-watched videosSimple, always worksNo personalization
Demographic SimilarityFind similar users by age/locationSome personalizationPrivacy concerns
Onboarding QuizAsk explicit preferencesHigh-quality signalUser friction
ExplorationShow diverse content, learn fastBuilds profile quicklyInitial experience may be poor

New Item Strategies

StrategyHow It WorksProsCons
Content FeaturesUse title, tags, thumbnail embeddingsImmediate availabilityIgnores taste
Creator TransferNew video inherits creator's audienceLeverages existing dataUnfair to new creators
Freshness BoostTemporarily boost new content scoreEnsures visibilityMay show low-quality content
Exploration BudgetReserve 5-10% of impressions for new itemsGathers signal fastSlight quality hit

🎯 Interview Tip: When asked about cold start, don't just say "use popularity." Show you understand the trade-off: popularity works but creates a "rich get richer" problem. Mention exploration/exploitation (Multi-Armed Bandits, Thompson Sampling) to show depth.

10. Evaluation Metrics

You cannot improve what you cannot measure. Recommendation systems require BOTH offline and online metrics.

Offline Metrics (Before Deployment)

These are computed on historical data before pushing a model to production.

MetricFormulaWhat It MeasuresTarget
Precision@K(Relevant in Top K) / KOf the K shown, how many were relevant?> 0.3
Recall@K(Relevant in Top K) / Total RelevantOf ALL relevant items, how many did we find?> 0.5
NDCG@KDCG / IDCGRanking quality (rewards good items at top positions)> 0.7
AUCArea Under ROCClick prediction accuracy> 0.8

Online Metrics (A/B Testing in Production)

These are the real business metrics measured after deployment.

MetricWhat It MeasuresWhy It Matters
CTRClick-through rateImmediate engagement signal
Watch TimeTotal minutes watchedQuality of recommendations
Completion Rate% of videos watched to endContent-match quality
D1/D7 RetentionUsers returning after 1/7 daysLong-term platform health
DiversityUnique categories in feedPrevents filter bubbles
Loading diagram...

⚠️ Common Pitfall: Optimizing only for CTR leads to clickbait. Optimizing only for watch time leads to addictive content. The best systems use a weighted combination:

Score=w1P(Click)+w2P(Complete)+w3P(Return)\text{Score} = w_1 \cdot P(\text{Click}) + w_2 \cdot P(\text{Complete}) + w_3 \cdot P(\text{Return})

11. Trade-offs & Alternatives

Every design decision has a cost. Here are the major ones for this architecture.

Architectural Decisions

DecisionOption AOption BWe ChoseWhy
Embedding UpdateReal-timeBatch (Daily)HybridUser embeddings update near real-time (to catch immediate interests), Video embeddings update daily (content changes slowly).
DatabaseSQL (Postgres)NoSQL (Cassandra)BothPostgres for user metadata (ACID compliance), Cassandra/DynamoDB for interaction logs (massive write throughput).
ServingPre-compute (Cache)On-the-flyOn-the-flyPre-computing fails for active users whose context changes fast. We only cache the "Head" (popular) queries.
Model FormatTensorFlow SavedModelONNXONNXFramework-agnostic, works with Triton Inference Server, easier to optimize.
Vector DBSelf-hosted FaissManaged PineconeFaiss on K8sCost savings at scale. Pinecone gets expensive at 10M+ vectors.

Model Architecture Trade-offs

ApproachLatencyAccuracyCold StartInterpretability
Matrix FactorizationVery FastGoodPoorMedium
Two-Tower DNNFastVery GoodGoodLow
Transformer (Sequential)SlowExcellentExcellentVery Low
Graph Neural NetworkMediumVery GoodGoodLow

🎯 Interview Tip: When asked "why not use Transformers everywhere?", explain the latency constraint. Transformers are O(n2)O(n^2) in sequence length. For retrieval over 10M items, even a small Transformer per item is prohibitive. We use Transformers only in ranking (500 items) or for encoding text features offline.

Case Study: Netflix vs. TikTok

FeatureNetflixTikTok
Context"Lean Back" (TV, 2-hour movies)"Lean Forward" (Mobile, 15s clips)
Signal DensityLow (1 movie decision per night)Extremely High (100 swipes per session)
ExplorationLow Risk (Users stick to safe choices)High Risk (Must show new viral trends instantly)
ArchitectureHeavy reliance on long-term historyHeavy reliance on immediate session context (RNNs/Transformers)

Netflix focuses on accuracy because starting a bad 2-hour movie is painful. TikTok focuses on speed and adaptation because skipping a bad 15-second video is costless.

12. Conclusion

Designing a recommendation system at scale is about managing the funnel. You start with millions of items, filter them down with fast approximate algorithms (retrieval), and then carefully rank the survivors with heavy deep learning models.

Loading diagram...

Key Takeaways for Interviews

  1. Always start with requirements. Clarify scale, latency, and cold start before drawing boxes.

  2. Separate retrieval from ranking. This is the most important architectural decision. Retrieval prioritizes recall (fast, approximate). Ranking prioritizes precision (slow, accurate).

  3. Don't forget cold start. New users and new items break collaborative filtering. Always have a fallback (popularity, content-based features, exploration).

  4. Measure both offline AND online metrics. High offline NDCG doesn't guarantee high online watch time. Always A/B test.

  5. Negative sampling matters. Without negatives, the model thinks everything is relevant. In-batch negatives are the industry standard.

The "secret sauce" isn't just the algorithm—it's the infrastructure that feeds it:

  • Kafka for real-time data ingestion at 500GB/day.
  • Vector Databases (Faiss) for sub-millisecond search over 10M embeddings.
  • Feature Stores (Redis) to provide instant user context without querying raw logs.
  • Model Serving (Triton) for low-latency GPU inference at 50K QPS.

Further Reading

To master the concepts powering these models:

Practice Problems

  1. Trending Now: Design a "Trending Now" section. Does it fit into the main funnel, or is it a separate cache?

  2. Diversity vs. Relevance: How would you ensure users see content from at least 5 different categories? Where in the pipeline do you enforce this?

  3. Creator Fairness: Small creators complain their videos never get recommended. How would you address this without hurting engagement metrics?

  4. Real-Time Personalization: A user just watched 3 cooking videos in a row. How fast can you update their recommendations?

These questions test whether you understand not just the "what" but the "why" behind each architectural decision.