Loading Now

Summary of On the Global Convergence Of Policy Gradient in Average Reward Markov Decision Processes, by Navdeep Kumar et al.


On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

by Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y. Levy, R. Srikant, Shie Mannor

First submitted to arxiv on: 11 Mar 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Systems and Control (eess.SY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a finite-time global convergence analysis of policy gradient in infinite horizon average reward Markov decision processes (MDPs). The focus is on ergodic tabular MDPs with finite state and action spaces. The analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of O(1/T), which translates to O(log(T)) regret, where T represents the number of iterations. This work provides finite-time performance guarantees for average-reward MDPs, unlike prior work on discounted reward MDPs whose bounds grow proportional to the fifth power of the effective horizon. The primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and obtaining finite-time performance guarantees. In addition, the paper reexamines and improves existing performance bounds for discounted reward MDPs. Simulations are presented to empirically evaluate the performance of the average reward policy gradient algorithm.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper is about a new way to understand how a computer program called policy gradient works when making decisions in complex situations like games or transportation systems. The researchers want to know if this program gets better over time and how fast it can get there. They found that the program works well for big problems, but they need to adjust their calculations to make sure it’s accurate. This is important because many real-life systems use similar ideas.

Keywords

* Artificial intelligence