Loading Now

Summary of Gametraversalbenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2d Game Maps, by Muhammad Umair Nasir et al.


GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

by Muhammad Umair Nasir, Steven James, Julian Togelius

First submitted to arxiv on: 10 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the planning capabilities of large language models (LLMs) by proposing a benchmark, GameTraversalBenchmark (GTB), which consists of diverse 2D grid-based game maps. The goal is to evaluate LLMs’ ability to traverse through given objectives with minimal steps and generation errors. Several LLMs are tested on GTB, with GPT-4-Turbo achieving the highest score of 44.97% on GTBS, a composite score combining planning, efficiency, and accuracy. The study also explores large reasoning models, such as o1, which scores 67.84% on GTBS, highlighting the challenge for current models.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models (LLMs) are super smart computers that can understand and generate human-like text. Recently, they’ve been great at doing this, but it’s not clear if they can also plan ahead. Think of planning like solving a puzzle or playing a game. The researchers created a special test to see how well LLMs do with planning. They called it GameTraversalBenchmark (GTB) and used 2D grid-based game maps as the puzzles. To pass the test, an LLM needs to get from one point to another in the fewest steps possible while making minimal mistakes. Some LLMs did better than others on this test, but there’s still room for improvement.

Keywords

» Artificial intelligence  » Gpt