Loading Now

Summary of A Framework For Evaluating Llms Under Task Indeterminacy, by Luke Guerdan et al.


A Framework for Evaluating LLMs Under Task Indeterminacy

by Luke Guerdan, Hanna Wallach, Solon Barocas, Alexandra Chouldechova

First submitted to arxiv on: 21 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a framework for evaluating large language models (LLMs) when the tasks they’re being tested on are ambiguous, vague, or both. This is known as task indeterminacy, where some items in the evaluation corpus have multiple correct responses. The authors develop a framework that takes into account the relationships between task specification, human ratings, and LLM responses in the evaluation pipeline. They demonstrate that using the “gold label” assumption underestimates true performance and provide a method for estimating error-adjusted performance intervals given partial knowledge about indeterminate items. This work has implications for the research community and highlights the need to reconsider traditional evaluation methods.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is all about making sure we’re not fooled by language models when they’re given tricky tasks. You see, sometimes these tasks can be super unclear or open-ended, which makes it hard to decide what’s “right” or “wrong”. The authors created a new way to evaluate language models that takes this into account. They showed that if we just assume there’s one correct answer (like we usually do), we’ll end up underestimating how well the model is really doing. They also gave us a tool to fix this problem by adjusting for the uncertainty in those tricky tasks.

Keywords

* Artificial intelligence