Models & Researchagentic rlreinforcement learningai researchfact verification

ProFact applies agentic RL to fact verification

|June 12, 2026|By LDS Team

6.2

Relevance Score

ProFact applies agentic RL to fact verification

Researchers at Sun Yat-sen University introduced ProFact, an agentic reinforcement learning framework that trains a single policy to handle all four stages of automated fact verification, described in a paper posted to arXiv on June 11, 2026 (arXiv:2606.13262). Rather than tuning claim decomposition, evidence retrieval, answer generation, and verdict prediction separately, ProFact treats the full pipeline as one trajectory and uses process-aware rewards, in a three-stage Question-Search-Verdict rollout, so early decisions can adapt to what the final verdict actually needs. On the AVeriTeC benchmark, the authors report it outperforms established baselines including HerO and InFact on both accuracy and inference efficiency. For practitioners building fact-checking or retrieval-augmented pipelines, the result argues for training multi-stage systems end to end rather than stage by stage.

For teams building automated fact-checking or retrieval-augmented verification pipelines, this paper is a data point in a broader shift: optimize the whole pipeline end-to-end with reinforcement learning rather than tuning each stage in isolation. The core practical claim is that a single policy trained with stage-level rewards can out-coordinate systems where each module (decomposition, retrieval, answer generation, verdict) is trained separately - exactly the failure mode teams hit when a well-tuned retriever surfaces evidence a downstream verdict model can't actually use well.

What happened

Researchers Rongxin Yang, Shenghong He, Siyuan Zhu, and Chao Yu (Sun Yat-sen University) posted "From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification" to arXiv on June 11, 2026 (arXiv:2606.13262). The paper introduces ProFact, which frames fact verification as a three-stage Question-Search-Verdict trajectory and trains one policy, via reinforcement learning, to generate verification questions, retrieve and use evidence, and predict a final veracity label, rather than training separate models for each stage.

Technical context

ProFact optimizes the policy with GRPO (the group-relative RL method used in DeepSeekMath) over full verification trajectories. Because the final veracity label alone gives a sparse, delayed signal, the authors add a process-aware reward that scores intermediate steps too, using METEOR-based matching against gold-standard questions and evidence so good decomposition and retrieval get credit along the way, not just a correct final verdict. Evidence retrieval uses Qwen3-Embedding-0.6B to encode documents into a searchable index. Per the abstract, evaluation on the AVeriTeC fact-verification benchmark shows ProFact consistently outperforms strong baselines, including prior structured-workflow systems such as InFact, HerO, and DebateCV, in both verification accuracy and inference efficiency.

For practitioners

The credit-assignment problem described here, where a retrieval strategy tuned in isolation may not surface the evidence a verdict model actually needs, is common to any multi-stage LLM pipeline, not just fact-checking: RAG systems, multi-hop QA, and agentic research tools face the same coordination gap. ProFact's stage-level reward design is a concrete recipe for testing whether joint optimization pays off before betting a system's evidence-to-verdict pipeline on independently trained modules.

What to watch

As with most single-paper arXiv results, the outperformance claim comes from the authors' own benchmark run; independent replication and released code aren't yet confirmed. Watch for a published venue placement, a code release, and follow-up comparisons against other 2026 process-reward methods for multi-stage LLM pipelines.

Key Points

1ProFact trains one reinforcement-learning policy to jointly handle claim decomposition, evidence retrieval, and verdict prediction instead of separate stage-specific models.
2A process-aware, METEOR-scored reward gives intermediate feedback at each stage, addressing the sparse-signal problem of learning only from final verdict accuracy.
3On the AVeriTeC benchmark the authors report gains in both accuracy and inference efficiency over prior structured-workflow baselines, favoring end-to-end pipeline training.

Scoring Rationale

Solid, well-scoped research contribution addressing a real credit-assignment problem in multi-stage LLM pipelines, with a clear method (process-aware rewards, GRPO) and a named benchmark (AVeriTeC). Held at 6.2 rather than pushed higher because it is a single, not-yet-independently-verified preprint reporting its own benchmark numbers with no external replication or code release yet.

MoreAI Research news

Sources

Public references used for this report.

1 source

arxiv.orgFrom Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems