Summary of Plancraft: An Evaluation Dataset For Planning with Llm Agents, by Gautier Dagan et al.
Plancraft: an evaluation dataset for planning with LLM agents
by Gautier Dagan, Frank Keller, Alex Lascarides
First submitted to arxiv on: 30 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel multi-modal evaluation dataset for Large Language Model (LLM) agents is presented in this paper, which includes both text-only and multi-modal interfaces based on the Minecraft crafting GUI. The dataset, called Plancraft, evaluates tool use, Retrieval Augmented Generation (RAG), and decision-making through a realistic challenge that requires agents to decide whether tasks are solvable or not. Benchmarking open-source and closed-source LLMs and strategies on this task reveals that they struggle with the planning problems introduced by Plancraft, suggesting avenues for improvement. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a special dataset called Plancraft that helps test how well computer programs can make decisions and solve problems. The dataset uses the popular video game Minecraft to design a challenge that is very realistic. The goal is to see if these computer programs can figure out when they should give up trying to solve a problem, not just try everything. So far, the results show that these computer programs are not very good at this and need to be improved. |
Keywords
» Artificial intelligence » Large language model » Multi modal » Rag » Retrieval augmented generation