Dropbox Uses LLMs To Improve Search

Dropbox engineers describe using large language models to amplify human relevance labeling for Dash search, calibrating LLM evaluators against a small human-labeled set to produce hundreds of thousands to millions of labels and amplify human effort roughly 100×. They report the method improves retrieval ranking — the bottleneck in retrieval-augmented generation — by combining automated LLM judgments with human oversight and hardest-mistake analysis.
Key Points
- 1Amplifies human labeling roughly 100× by letting LLMs generate hundreds of thousands or millions of labels
- 2Improves RAG output because retrieval ranking quality directly impacts final generated answers
- 3Enables scalable training data for ranking models, requiring human-calibrated evaluation and hardest-mistake analysis
Scoring Rationale
Strong practical impact from scalable, human-calibrated LLM labeling; slightly limited by incremental novelty over existing RAG practices.
Sources
Public references used for this report.
Practice with real FinTech & Trading data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all FinTech & Trading problems
