Summary of Plausibly Problematic Questions in Multiple-choice Benchmarks For Commonsense Reasoning, by Shramay Palta et al.
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
by Shramay Palta, Nishant Balepur, Peter Rankel, Sarah Wiegreffe, Marine Carpuat, Rachel Rudinger
First submitted to arxiv on: 6 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper tackles a challenging problem in machine learning by developing a novel approach to evaluating commonsense reasoning models. The authors argue that traditional multiple-choice question (MCQ) benchmarks are flawed because they require selecting a single correct answer, whereas in real-life scenarios, there may be multiple plausible answers. To address this issue, the researchers collect 5,000 independent plausibility judgments on answer choices for 250 MCQ items from two commonsense reasoning benchmarks. They find that over 20% of the sampled MCQs have a mismatch between the most plausible answer choice and the benchmark gold answers, often due to ambiguity or semantic mismatches. The authors then conduct experiments with large language models (LLMs) and observe low accuracy and high variation in performance on this subset, suggesting their plausibility criterion can help identify more reliable benchmark items for commonsense evaluation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about how we test machines’ ability to understand everyday situations. Right now, we ask these machines multiple-choice questions with one correct answer. But what if there are many possible answers? The researchers looked at 250 questions and asked many people which answer they thought was most plausible. They found that in over 20% of the cases, the machine’s answer didn’t match the gold standard. This is because some questions are tricky or don’t make sense. The authors also tested big language models and saw how badly they did on these tricky questions. This research helps us create better tests for machines to understand everyday situations. |
Keywords
» Artificial intelligence » Machine learning