Summarization Alters LLM Relevance Judgment Reliability
Samaneh Mohtadi (submitted Dec. 5, 2025) investigates how text summarization affects LLM-based relevance judgments for IR. Using state-of-the-art LLMs across multiple TREC collections, the study compares full-document judgments with LLM-generated summaries of varying lengths, measuring agreement with human labels and effects on retrieval evaluation. It finds summary-based judgments preserve system-ranking stability but introduce systematic label shifts and model/dataset-dependent biases.
Key Points
- 1Demonstrates that LLM judgments from summaries match system-ranking stability of full-document judgments
- 2Identifies systematic label distribution shifts and model/dataset-dependent biases introduced by summarization
- 3Warns practitioners to validate summary length and model choice to avoid misleading IR evaluation results
Scoring Rationale
Methodological insight with direct evaluation implications; limited novelty and single preprint source constrain broader impact.
Sources
Public references used for this report.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems
