Nugget-Based RAG Systems Expose Evaluation Circularity

Researchers (Dietz et al.) on Jan 19, 2026, show that nugget-based retrieval-augmented generation (RAG) systems can produce inflated evaluation results when optimized against LLM judges. In experiments comparing Ginger and Crucible to GPT-Researcher, deliberately modified Crucible achieved near-perfect scores when prompts or gold nuggets leaked or became predictable. The authors call for blind evaluation settings and methodological diversity to prevent metric overfitting.
Scoring Rationale
High novelty and broad impact drive score; limited by single arXiv preprint lacking peer review.
Practice with real FinTech & Trading data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all FinTech & Trading problems


