Loading Now

Summary of Planetarium: a Rigorous Benchmark For Translating Text to Structured Planning Languages, by Max Zuo et al.


Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

by Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach

First submitted to arxiv on: 3 Jul 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach to planning problems using language models has been explored in recent works. Specifically, researchers have looked at translating natural language descriptions of planning tasks into structured planning languages like PDDL (Planning Domain Definition Language). However, existing evaluation methods struggle to ensure semantic correctness and rely on simplistic or unrealistic datasets. To address this gap, the authors introduce Planetarium, a benchmark designed to evaluate language models’ ability to generate PDDL code from natural language descriptions of planning tasks. Planetarium features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. The authors also evaluate several API-access and open-weight language models, revealing the complexity of this task. For example, GPT-4o generates PDDL problem descriptions that are syntactically parseable (96.1%), solvable (94.4%), but only semantically correct (24.8%). This highlights the need for a more rigorous benchmark for this problem.
Low GrooveSquid.com (original content) Low Difficulty Summary
Planning experts have been using language models to solve problems, and they’ve found a way to make it work! They’re taking natural language descriptions of tasks and turning them into special planning languages called PDDL. But there was a big problem – existing ways to test how good these translations were weren’t very good. So, the researchers created something new called Planetarium. It’s like a big toolbox that helps figure out if the translated plans are correct or not. They used it with a bunch of text and planning combinations to see how well different language models did. One model, GPT-4o, was really good at making plans that looked right on paper (96.1%), but only 24.8% of them actually worked correctly.

Keywords

» Artificial intelligence  » Gpt