Researchers Analyze Noise In LLM Evaluations
Researchers (Sida I. Wang) on December 24, 2025 define and measure three noise types in LLM evaluations: prediction noise, data noise, and combined total noise. They introduce an all-pairs paired method using millions of question-level predictions, show each evaluation's total noise is characteristic, and report that averaging predictions reduces prediction noise and substantially increases statistical power.
Key Points
- 1Define and measure three noise types: prediction noise, data noise, and total noise across LLM evals
- 2Reveal that each evaluation exhibits a consistent total noise level across model pairs and settings
- 3Show paired prediction noise often exceeds data noise, so averaging predictions boosts statistical power
Scoring Rationale
Strong methodological advance with practical evaluation guidance; limited by being a single preprint without peer review.
Sources
Public references used for this report.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems
