Loading Now

Summary of Look at the Text: Instruction-tuned Language Models Are More Robust Multiple Choice Selectors Than You Think, by Xinpeng Wang et al.


Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

by Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank

First submitted to arxiv on: 12 Apr 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, researchers investigate the limitations of traditional evaluations for large language models (LLMs) in multiple-choice question tasks. They show that the current approach, ranking candidate answers by log probabilities, lacks robustness to changes in MCQ phrasing and doesn’t match text answers for instruction-tuned models. Instead, they examine the text output as an alternative evaluation method. The results demonstrate that text answers are more robust to perturbations than first token probabilities when there’s a mismatch between them. This difference grows as the mismatch rate increases. Furthermore, they find that text answers remain more robust even with state-of-the-art debiasing methods like PriDe, which highlights the benefits of text answer evaluation over first token probability evaluation.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how we can better test large language models’ abilities to answer questions. Right now, people usually rank answers based on how likely they are to come next. But this method isn’t very good because it doesn’t match up with what the model actually says. The researchers wanted to see if looking at the text answers instead would be a better way to test these models. They found that when the answer is different from what the model predicts, using text answers is more reliable than ranking by likelihood. This study shows that we should look at the actual answers instead of just the probabilities.

Keywords

» Artificial intelligence  » Likelihood  » Probability  » Token