Researchon policy rlllm alignmentsparse rewards
SAGE Improves GRPO Under Sparse Rewards
8.1
Relevance Score
Researchers (Dong et al.) on Feb 3, 2026 propose SAGE, an on-policy RL framework that injects privileged compact hints during Group Relative Policy Optimization (GRPO) training to increase within-group outcome diversity and prevent advantage collapse under sparse terminal verifier rewards. They evaluate SAGE across six benchmarks with three LLMs, reporting average improvements of +2.0 (Llama-3.2-3B), +1.2 (Qwen2.5-7B), and +1.3 (Qwen3-4B); code released.

