LLM Judges Improve PDF Table Extraction Evaluation

The paper presents a benchmarking framework using synthetically generated PDFs with precise LaTeX ground truth, sourcing tables from arXiv to ensure realistic complexity. It introduces an LLM-as-a-judge semantic evaluation integrated into a matching pipeline, showing LLM-based scores correlate with human judgments at Pearson r=0.93 versus TEDS r=0.68 and GriTS r=0.70. Evaluating 21 PDF parsers across 100 documents (451 tables) reveals major performance gaps and provides practical parser selection guidance.
Scoring Rationale
Strong methodological contribution and actionable benchmark, supported by human validation, but single-source arXiv preprint limits peer-reviewed confirmation.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

