Loading Now

Summary of Workbench: a Benchmark Dataset For Agents in a Realistic Workplace Setting, by Olly Styles et al.


WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

by Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, Bertie Vidgen

First submitted to arxiv on: 1 May 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces WorkBench, a benchmark dataset designed to evaluate the ability of agents to execute tasks in a workplace setting. The dataset contains a sandbox environment with five databases, 26 tools, and 690 tasks that represent common business activities such as sending emails and scheduling meetings. The tasks require planning, tool selection, and often multiple actions, making them challenging to complete successfully. The correct outcome for each task is unique and unambiguous, allowing for robust, automated evaluation. This key contribution is called outcome-centric evaluation. The authors evaluate five existing ReAct agents on WorkBench, finding that they successfully complete as few as 3% of tasks (Llama2-70B) to just 43% for the best-performing (GPT-4). They also find that agents’ errors can result in incorrect actions being taken, such as sending an email to the wrong person. WorkBench reveals weaknesses in agents’ ability to undertake common business activities, raising questions about their use in high-stakes workplace settings.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper introduces a new way to test how well computer programs can do everyday work tasks. The program is called WorkBench and it has lots of different scenarios that need to be completed, like sending emails or scheduling meetings. These tasks are hard because they require planning and choosing the right tools. The authors tested five existing programs on WorkBench and found that they didn’t do very well, only completing a few percent of the tasks correctly. They also found that when the programs made mistakes, it could have serious consequences, like sending an email to the wrong person. This shows that we need to be careful about using these programs in important situations.

Keywords

» Artificial intelligence  » Gpt