Loading Now

Summary of Program Synthesis Benchmark For Visual Programming in Xlogoonline Environment, by Chao Wen et al.


Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

by Chao Wen, Jacqueline Staub, Adish Singla

First submitted to arxiv on: 17 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores how large language and multimodal models perform when tasked with combining specific skills like general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. The researchers curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment, comprising 85 real-world tasks that require various skills such as spatial planning, basic programming, and logical reasoning. Current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving low success rates of 20% and 2.35%. To improve performance, the authors develop a fine-tuning pipeline leveraging a large-scale synthetic training dataset with over 80,000 tasks. They also showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. The results demonstrate that a fine-tuned Llama3-8B outperforms GPT-4V and Llama3-70B models and provide an in-depth analysis of the models’ expertise across different skill dimensions.
Low GrooveSquid.com (original content) Low Difficulty Summary
Program synthesis is like solving puzzles! Researchers are trying to figure out how big language models can do this. They made a special test with 85 tasks that need different skills, like planning and programming. Current top models didn’t do well, only getting 20% or 2.35% correct. To help them improve, the authors created a way to train the models using many examples and gave them feedback. This made one model, Llama3-8B, much better at solving these puzzles.

Keywords

» Artificial intelligence  » Fine tuning  » Gpt  » Language understanding  » Question answering