Loading Now

Summary of Ares: Alternating Reinforcement Learning and Supervised Fine-tuning For Enhanced Multi-modal Chain-of-thought Reasoning Through Diverse Ai Feedback, by Ju-seung Byun et al.


ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback

by Ju-Seung Byun, Jiyun Chun, Jihyung Kil, Andrew Perrault

First submitted to arxiv on: 25 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Multimodal Models (LMMs) excel at comprehending human instructions and achieve remarkable results across various tasks. Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) refine LLMs by aligning them with specific preferences, primarily using ranking-based feedback for entire generations. The proposed two-stage algorithm ARES Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT). First, it requests sentence-level feedback from the Teacher to score each contribution to solving a problem in Chain-of-Thought (CoT), providing granular rewards. Second, it asks for correction feedback after RL to stabilize the fine-tuned model through SFT. Experiments on ScienceQA and A-OKVQA demonstrate ARES’s effectiveness, achieving a 70% win rate against baseline models judged by GPT-4o, and increasing average inference answer accuracy by 2.5%.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Multimodal Models (LMMs) are super smart at understanding human instructions and do really well on lots of tasks. This paper helps LMMs get even better by asking for feedback from people or other AI models. The new way works in two steps: first, it asks the Teacher to rate how helpful each sentence is in solving a problem, then it uses that information to make the model better. The team tested this idea and found it works really well, with some amazing results!

Keywords

» Artificial intelligence  » Fine tuning  » Gpt  » Inference  » Reinforcement learning  » Reinforcement learning from human feedback  » Rlhf  » Supervised