PolyBench Evaluates LLM Forecasting and Trading Performance

PolyBench is a new multimodal benchmark that evaluates Large Language Models on timestamped prediction-market states and synchronous order-book and news data. The dataset records 38,666 binary markets across 4,997 events with snapshot-aligned Central Limit Order Book (CLOB) states and a live news stream. The authors ran seven state-of-the-art LLMs, producing 36,165 predictions between February 6 and 12, 2026, and assessed performance using directional accuracy, a new Confidence-Weighted Return (CWR) metric, Annualized Percentage Yield (APY), and Sharpe ratio under simulated order-book execution. Only two models, MiMo-V2-Flash (17.6% CWR) and Gemini-3-Flash (6.2% CWR), produced positive returns; the others lost money despite high reported confidence. PolyBench provides a contamination-resistant, financially-grounded evaluation that exposes gaps in probabilistic calibration and decision-making for LLMs in real-world market settings.
What happened
PolyBench, a new multimodal benchmark, couples timestamped snapshots from Polymarket with Central Limit Order Book (CLOB) states and a synchronized news stream to evaluate LLMs on real-time forecasting and trading. The dataset contains 38,666 binary markets spanning 4,997 events, and the authors generated 36,165 predictions from seven state-of-the-art LLMs over market states collected between February 6 and 12, 2026. Performance is measured using directional accuracy and financial metrics including the new Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio through realistic order-book execution simulation. Only MiMo-V2-Flash achieved 17.6% CWR and Gemini-3-Flash 6.2% CWR; the remaining five models produced net losses despite often high expressed confidence.
Technical details
PolyBench enforces timestamp-locked, point-in-time evaluation by recording market cross-sections concurrently with CLOB depth and a real-time news feed. Practitioners should note:
- •Evaluation metrics: directional accuracy, CWR, APY, Sharpe ratio
- •Execution model: simulated order-book fills using recorded CLOB states to estimate slippage and realistic PnL
- •Data controls: contamination-proof collection and timestamping to prevent lookahead bias
- •Scale: 38,666 market states, 4,997 events, 36,165 model predictions
Context and significance
PolyBench moves beyond static text or synthetic tasks by combining natural-language signals with market microstructure, forcing models to reconcile qualitative news with quantitative order-book dynamics under strict temporal constraints. This exposes weaknesses in calibration and probabilistic reasoning that are invisible in standard NLP benchmarks. For model builders, the paper highlights the difference between fluent confidence and economically useful probability estimates; many models overstate certainty and lose money when execution costs and market impact are included. PolyBench also establishes a reproducible, finance-grounded evaluation scaffold useful for research into uncertainty-aware output, temperature and calibration tuning, and integration of market-aware components.
What to watch
adoption of PolyBench by model developers and leaderboard maintainers, replication of results across more model families, and follow-up work integrating risk-aware decision layers or calibration techniques to convert fluent predictions into profitable, execution-aware actions.
Scoring Rationale
PolyBench introduces a novel, realistic evaluation that matters for model calibration and decision-making research, but it is a single benchmark paper rather than a platform-changing release. The April 3, 2026 submission date reduces immediacy, so the score reflects notable technical value with moderate near-term impact.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

