Summary of Maqa: Evaluating Uncertainty Quantification in Llms Regarding Data Uncertainty, by Yongjin Yang et al.
MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty
by Yongjin Yang, Haneul Yoo, Hwaran Lee
First submitted to arxiv on: 13 Aug 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates uncertainty quantification methods used in large language models (LLMs) and evaluates their performance under data uncertainty, which arises from irreducible randomness. The authors propose a new dataset, MAQA, to assess uncertainty quantification regarding data uncertainty. They also examine five uncertainty quantification methods of diverse white- and black-box LLMs. The findings show that entropy-based and consistency-based methods estimate model uncertainty well even in the presence of data uncertainty. However, other methods struggle depending on the tasks, with overconfidence observed in reasoning tasks for white-box LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper looks at how good large language models are at giving correct answers. Right now, these models can give answers that sound right but aren’t actually true. To fix this, researchers have been trying to figure out if the answer is correct or not by looking at how sure the model is about its response. But most of these methods only look at whether the model knows the answer, not if there’s any chance it might be wrong because of the data itself being uncertain. This paper looks at previous ways of doing this and proposes a new way to test them using a special kind of dataset that includes questions that require reasoning or knowledge. The results show that some methods are better than others and that they work differently depending on what kind of question is asked. |