Authors Extend Policy Gradients to Non-Markovian RL

According to the arXiv submission (arXiv:2605.10816, submitted 11 May 2026), Avik Kar and six coauthors propose a reward-centric formulation for reinforcement learning in non-Markovian decision processes (NMDPs). The paper introduces the Agent State-Markov (ASM) policy class, in which an internal agent state is recursively updated and a control policy maps that state to actions. The authors state a new policy gradient theorem for ASM policies covering episodic and infinite-horizon discounted NMDPs, and present the ASMPG algorithm that exploits the agent state recursion for optimization. Per the paper, the authors provide finite-time and almost sure convergence guarantees, and report empirical results where ASMPG outperforms baselines that learn state representations via predictive objectives.
What happened
According to the arXiv submission (arXiv:2605.10816, submitted 11 May 2026), Avik Kar and six coauthors study policy gradient methods for non-Markovian decision processes (NMDPs). The paper defines the Agent State-Markov (ASM) policy class, derives a policy gradient theorem for ASM policies in both episodic and infinite-horizon discounted NMDPs, and introduces the ASMPG algorithm. The authors report finite-time and almost sure convergence guarantees and present empirical comparisons showing ASMPG outperforming baselines that learn state representations through predictive objectives.
Technical details
Editorial analysis: The paper frames non-Markovian dynamics by giving the agent an internal state that is updated recursively and jointly optimizing that state dynamics with the control policy. The core theoretical result is a gradient expression that generalizes classical policy gradient formulas to ASM policies; the submission uses that expression to construct the ASMPG optimizer and to derive convergence bounds. Reported empirical tasks are described as non-Markovian benchmarks where reward depends on interaction history, and the authors claim performance gains relative to representation-learning baselines.
Context and significance
Industry context: Extending policy gradient theory to NMDPs addresses a long-standing gap between models that assume Markovian environments and real-world problems with long-range dependencies. For practitioners, a reward-centric approach to learning internal state dynamics shifts evaluation from predictive loss to end-to-end reward optimization, which can change how representation learning is benchmarked in sequential decision tasks.
What to watch
Editorial analysis: Follow peer review, code release, and reproducibility of the reported results on standard non-Markovian benchmarks. Observers should also watch for extensions that scale the approach to large function approximators and continuous control.
Scoring Rationale
This arXiv paper provides a notable theoretical extension of policy gradients to NMDPs with convergence guarantees and empirical claims, making it relevant to researchers and practitioners. It is a mid-tier research advance rather than a paradigm shift, and the one-day freshness reduces the score marginally.
Practice with real Ad Tech data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ad Tech problems
