Summary of Ares: Alternating Reinforcement Learning and Supervised Fine-tuning For Enhanced Multi-modal Chain-of-thought Reasoning Through Diverse Ai Feedback, by Ju-seung Byun et al.
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
by Ju-Seung Byun, Jiyun Chun, Jihyung Kil, Andrew Perrault
First submitted to arxiv on: 25 Jun 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Multimodal Models (LMMs) excel at comprehending human instructions and achieve remarkable results across various tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) refine LLMs by aligning them with specific preferences, primarily using ranking-based feedback for entire generations. The proposed two-stage algorithm ARES Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, it requests sentence-level feedback from the Teacher to score each contribution to solving a problem in Chain-of-Thought (CoT), providing granular rewards. Second, it asks for correction feedback after RL to stabilize the fine-tuned model through SFT. Experiments on ScienceQA and A-OKVQA demonstrate ARES’s effectiveness, achieving a 70% win rate against baseline models judged by GPT-4o, and increasing average inference answer accuracy by 2.5%. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Multimodal Models (LMMs) are super smart at understanding human instructions and do really well on lots of tasks. This paper helps LMMs get even better by asking for feedback from people or other AI models. The new way works in two steps: first, it asks the Teacher to rate how helpful each sentence is in solving a problem, then it uses that information to make the model better. The team tested this idea and found it works really well, with some amazing results! |
Keywords
» Artificial intelligence » Fine tuning » Gpt » Inference » Reinforcement learning » Reinforcement learning from human feedback » Rlhf » Supervised