Summary of Improving Reward-conditioned Policies For Multi-armed Bandits Using Normalized Weight Functions, by Kai Xu et al.
Improving Reward-Conditioned Policies for Multi-Armed Bandits using Normalized Weight Functions
by Kai Xu, Farid Tajaddodianfar, Ben Allison
First submitted to arxiv on: 16 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The recently introduced Reward-Conditioned Policies (RCPs) offer an attractive alternative in Reinforcement Learning. By leveraging supervised learning for policy updates, RCPs simplify the process compared to Policy Gradient methods. However, they struggle to keep pace with classic approaches like Upper Confidence Bound and Thompson Sampling in Multi-Armed Bandit (MAB) problems. To overcome this limitation, we propose Generalized Marginalization, a technique that constructs policies by normalizing reward functions using weighted sums or integrals equal to 1. This approach allows negative weights for low-reward policies, making the resulting policies more distinct. We explore strategies for applying Generalized Marginalization in discrete-action MABs and demonstrate its effectiveness through simulations, achieving superior performance on challenging MABs with large action spaces and sparse reward signals. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary RCPs are a new way to learn in Reinforcement Learning. They’re simpler than some other methods because they use supervised learning. However, for certain problems called Multi-Armed Bandits, RCPs aren’t as good as older approaches like Upper Confidence Bound and Thompson Sampling. To make RCPs better, we came up with a new technique called Generalized Marginalization. It helps by making the policies more different from each other. We tested this idea on some tricky problems and found that it makes RCPs work just as well as the older methods. |
Keywords
» Artificial intelligence » Reinforcement learning » Supervised