Summary of Planetarium: a Rigorous Benchmark For Translating Text to Structured Planning Languages, by Max Zuo et al.
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
by Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach
First submitted to arxiv on: 3 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach to planning problems using language models has been explored in recent works. Specifically, researchers have looked at translating natural language descriptions of planning tasks into structured planning languages like PDDL (Planning Domain Definition Language). However, existing evaluation methods struggle to ensure semantic correctness and rely on simplistic or unrealistic datasets. To address this gap, the authors introduce Planetarium, a benchmark designed to evaluate language models’ ability to generate PDDL code from natural language descriptions of planning tasks. Planetarium features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. The authors also evaluate several API-access and open-weight language models, revealing the complexity of this task. For example, GPT-4o generates PDDL problem descriptions that are syntactically parseable (96.1%), solvable (94.4%), but only semantically correct (24.8%). This highlights the need for a more rigorous benchmark for this problem. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Planning experts have been using language models to solve problems, and they’ve found a way to make it work! They’re taking natural language descriptions of tasks and turning them into special planning languages called PDDL. But there was a big problem – existing ways to test how good these translations were weren’t very good. So, the researchers created something new called Planetarium. It’s like a big toolbox that helps figure out if the translated plans are correct or not. They used it with a bunch of text and planning combinations to see how well different language models did. One model, GPT-4o, was really good at making plans that looked right on paper (96.1%), but only 24.8% of them actually worked correctly. |
Keywords
» Artificial intelligence » Gpt