Summary of Look at the Text: Instruction-tuned Language Models Are More Robust Multiple Choice Selectors Than You Think, by Xinpeng Wang et al.

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

by Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank

First submitted to arxiv on: 12 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary In this paper, researchers investigate the limitations of traditional evaluations for large language models (LLMs) in multiple-choice question tasks. They show that the current approach, ranking candidate answers by log probabilities, lacks robustness to changes in MCQ phrasing and doesn’t match text answers for instruction-tuned models. Instead, they examine the text output as an alternative evaluation method. The results demonstrate that text answers are more robust to perturbations than first token probabilities when there’s a mismatch between them. This difference grows as the mismatch rate increases. Furthermore, they find that text answers remain more robust even with state-of-the-art debiasing methods like PriDe, which highlights the benefits of text answer evaluation over first token probability evaluation.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how we can better test large language models’ abilities to answer questions. Right now, people usually rank answers based on how likely they are to come next. But this method isn’t very good because it doesn’t match up with what the model actually says. The researchers wanted to see if looking at the text answers instead would be a better way to test these models. They found that when the answer is different from what the model predicts, using text answers is more reliable than ranking by likelihood. This study shows that we should look at the actual answers instead of just the probabilities.

Keywords

* Artificial intelligence * Likelihood * Probability * Token

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

by Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Self-supervised Dataset Distillation: a Good Compression Is All You Need, by Muxin Zhou and Zeyuan Yin and Shitong Shao and Zhiqiang Shen

Summary of Linear Cross-document Event Coreference Resolution with X-amr, by Shafiuddin Rehan Ahmed et al.

Related Posts