Summary of On the Worst Prompt Performance Of Large Language Models, by Bowen Cao et al.

On the Worst Prompt Performance of Large Language Models

by Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

First submitted to arxiv on: 8 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates the sensitivity of large language models (LLMs) to prompt phrasing, highlighting concerns about their reliability in real-world scenarios. It introduces RobustAlpacaEval, a new benchmark that evaluates model performance against semantically equivalent case-level queries, emphasizing the importance of using the worst prompt performance as a lower bound. The study finds substantial variability in model performance, with some models performing significantly worse than others. For instance, the Llama-2-70B-chat model has a 45.48% difference between its best and worst performances, with a low point of 9.38%. The authors also explore prompt engineering and consistency methods to improve worst prompt performance but find limited impact.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper looks at how well large language models do when given different types of questions. They found that the models are very sensitive to the wording of the question, which makes them not very reliable for real-world uses. To help with this problem, they created a new way to test the models, called RobustAlpacaEval. This test has many different versions of the same question and shows how well each model does on all of them. The study found that some models do much better than others, and some even do very poorly. For example, one model did 9.38% worse on the worst questions than it did on the best ones.

Keywords

» Artificial intelligence » Llama » Prompt

On the Worst Prompt Performance of Large Language Models

by Bowen Cao, Deng Cai, Zhisong Zhang, Yuexian Zou, Wai Lam

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Sycophancy to Subterfuge: Investigating Reward-tampering in Large Language Models, by Carson Denison et al.

Summary of Researcharena: Benchmarking Large Language Models’ Ability to Collect and Organize Information As Research Agents, by Hao Kang et al.

Related Posts