Summary of Are Large Language Models Strategic Decision Makers? a Study Of Performance and Bias in Two-player Non-zero-sum Games, by Nathan Herr et al.
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games
by Nathan Herr, Fernando Acero, Roberta Raileanu, María Pérez-Ortiz, Zhibin Li
First submitted to arxiv on: 5 Jul 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the strategic decision-making abilities of Large Language Models (LLMs) in complex social scenarios, leveraging game theory to assess their performance. The authors evaluate GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B in canonical two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. The results show that the models are affected by systematic biases, including positional bias, payoff bias, and behavioural bias, which impact their performance when game configurations misalign with these biases. Interestingly, newer LLMs like GPT-4o suffer significant performance drops, while a chain-of-thought (CoT) prompting method can reduce biases in some models but worsen them in others. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how well Large Language Models do when making big decisions that involve other people. They use game theory to understand the LLMs’ choices. The authors tested GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B in two different games. They found that these models make mistakes because of things like how they’re positioned or what they want to get out of the game. This means their performance drops when the game doesn’t match up with those biases. Surprisingly, newer models don’t always do better than older ones. |
Keywords
» Artificial intelligence » Gpt » Llama » Prompting