Summary of Look at the Text: Instruction-tuned Language Models Are More Robust Multiple Choice Selectors Than You Think, by Xinpeng Wang et al.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
by Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
First submitted to arxiv on: 12 Apr 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary In this paper, researchers investigate the limitations of traditional evaluations for large language models (LLMs) in multiple-choice question tasks. They show that the current approach, ranking candidate answers by log probabilities, lacks robustness to changes in MCQ phrasing and doesn’t match text answers for instruction-tuned models. Instead, they examine the text output as an alternative evaluation method. The results demonstrate that text answers are more robust to perturbations than first token probabilities when there’s a mismatch between them. This difference grows as the mismatch rate increases. Furthermore, they find that text answers remain more robust even with state-of-the-art debiasing methods like PriDe, which highlights the benefits of text answer evaluation over first token probability evaluation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how we can better test large language models’ abilities to answer questions. Right now, people usually rank answers based on how likely they are to come next. But this method isn’t very good because it doesn’t match up with what the model actually says. The researchers wanted to see if looking at the text answers instead would be a better way to test these models. They found that when the answer is different from what the model predicts, using text answers is more reliable than ranking by likelihood. This study shows that we should look at the actual answers instead of just the probabilities. |
Keywords
» Artificial intelligence » Likelihood » Probability » Token