Loading Now

Summary of Swe-bench Multimodal: Do Ai Systems Generalize to Visual Software Domains?, by John Yang et al.


SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

by John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press

First submitted to arxiv on: 4 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Autonomous software engineering systems, capable of fixing bugs and developing features, are typically evaluated using the SWE-bench framework. However, SWE-bench primarily focuses on Python repositories, presenting problem statements as text without visual elements like images. This limitation motivates exploring how existing systems perform in unrepresented software engineering domains (e.g., front-end, game development, DevOps), which utilize different programming languages and paradigms. To address this gap, we propose SWE-bench Multimodal (SWE-bench M), a framework that evaluates systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Furthermore, we demonstrate that SWE-agent’s flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.
Low GrooveSquid.com (original content) Low Difficulty Summary
Scientists are working on creating computer systems that can help write code and fix bugs. These systems are usually tested using a framework called SWE-bench. However, this framework only uses Python programming language and doesn’t include images or other visual elements in its problem statements. This limitation motivates researchers to explore how these systems perform in different areas of software engineering that use different languages and styles. To address this issue, scientists propose a new framework called SWE-bench Multimodal (SWE-bench M), which evaluates systems on their ability to fix bugs in visual, user-facing JavaScript software. The new framework includes 617 task instances collected from 17 JavaScript libraries used for various purposes like designing web interfaces and creating diagrams. Each task instance contains at least one image or a visual element. Researchers found that the top-performing systems struggled with SWE-bench M, revealing limitations in solving visual problems and generalizing across languages.

Keywords

» Artificial intelligence  » Generalization  » Syntax