Summary of Program Synthesis Benchmark For Visual Programming in Xlogoonline Environment, by Chao Wen et al.

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

by Chao Wen, Jacqueline Staub, Adish Singla

First submitted to arxiv on: 17 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper explores how large language and multimodal models perform when tasked with combining specific skills like general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. The researchers curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment, comprising 85 real-world tasks that require various skills such as spatial planning, basic programming, and logical reasoning. Current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving low success rates of 20% and 2.35%. To improve performance, the authors develop a fine-tuning pipeline leveraging a large-scale synthetic training dataset with over 80,000 tasks. They also showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. The results demonstrate that a fine-tuned Llama3-8B outperforms GPT-4V and Llama3-70B models and provide an in-depth analysis of the models’ expertise across different skill dimensions.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Program synthesis is like solving puzzles! Researchers are trying to figure out how big language models can do this. They made a special test with 85 tasks that need different skills, like planning and programming. Current top models didn’t do well, only getting 20% or 2.35% correct. To help them improve, the authors created a way to train the models using many examples and gave them feedback. This made one model, Llama3-8B, much better at solving these puzzles.

Keywords

» Artificial intelligence » Fine tuning » Gpt » Language understanding » Question answering

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

by Chao Wen, Jacqueline Staub, Adish Singla

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Diffusion Models in Low-level Vision: a Survey, by Chunming He et al.

Summary of Full-ece: a Metric For Token-level Calibration on Large Language Models, by Han Liu et al.

Related Posts