Summary of Tmgbench: a Systematic Game Benchmark For Evaluating Strategic Reasoning Abilities Of Llms, by Haochuan Wang et al.
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
by Haochuan Wang, Xiachong Feng, Lei Li, Zhanyue Qin, Dianbo Sui, Lingpeng Kong
First submitted to arxiv on: 14 Oct 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computer Science and Game Theory (cs.GT)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed TMGBench benchmark aims to comprehensively evaluate large language models’ (LLMs) strategic reasoning capabilities by incorporating all 144 game types from the Robinson-Goforth topology of 2×2 games. The benchmark addresses existing limitations by providing novel scenarios, flexible organization, and synthetic data generation for diverse and higher-quality game settings. A sustainable framework is also introduced to accommodate increasingly powerful LLMs. Evaluations reveal flaws in accuracy, consistency, and Theory-of-Mind (ToM) mastery among mainstream LLMs, with OpenAI’s o1-mini model achieving 66.6%, 60.0%, and 70.0% accuracy rates on sequential, parallel, and nested games. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The researchers created a new benchmark called TMGBench to test how well big language models can reason strategically. They included all kinds of classic games and also made up some new scenarios to make the testing more diverse. The goal is to help these models get better at making decisions and understanding other people’s thoughts. |
Keywords
» Artificial intelligence » Synthetic data