Loading Now

Summary of Workarena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?, by Alexandre Drouin et al.


WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

by Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, Alexandre Lacoste

First submitted to arxiv on: 12 Mar 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed WorkArena benchmark assesses large language model-based agents’ ability to perform tasks that knowledge workers typically encounter when using enterprise software systems. The study introduces BrowserGym, an environment for designing and evaluating such agents, featuring a rich set of actions and multimodal observations. While current agents show promise on WorkArena, there remains a significant gap towards achieving full task automation. Notably, the analysis reveals a performance disparity between open- and closed-source LLMs, highlighting a critical area for future exploration.
Low GrooveSquid.com (original content) Low Difficulty Summary
We’re studying how well computer programs can interact with software using web browsers. We want to know if these programs can help people do their jobs more efficiently. To test this, we created a special test called WorkArena that has 33 tasks based on real-life work scenarios. We also made BrowserGym, which is like a big playground for these computer programs to practice and learn from. Our results show that these programs are good at some things, but still need improvement to do all the tasks perfectly. What’s interesting is that different types of language models perform differently, so we should focus on making them better.

Keywords

* Artificial intelligence  * Large language model