MASPO Improves RL Rewards Optimization For LLMs

Jiaye Lin (arXiv v1, Feb 19, 2026) proposes Mass-Adaptive Soft Policy Optimization (MASPO), a unified RL with verifiable rewards (RLVR) framework addressing three limitations in methods like GRPO. MASPO integrates a differentiable soft Gaussian gating, a mass-adaptive limiter, and an asymmetric risk controller, and the paper reports MASPO significantly outperforms strong baselines; code is available.
Scoring Rationale
Strong practical contributions and usable code, but limited evaluation and single arXiv preprint without peer review.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems

