Summary of Clasheval: Quantifying the Tug-of-war Between An Llm’s Internal Prior and External Evidence, by Kevin Wu and Eric Wu and James Zou

ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence

by Kevin Wu, Eric Wu, James Zou

First submitted to arxiv on: 16 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates how large language models (LLMs) handle retrieved information from external sources. The authors note that while retrieval augmented generation (RAG) aims to provide up-to-date knowledge, it can sometimes lead to incorrect or harmful content being presented in context. They curate a dataset of over 1200 questions across six domains and apply subtle to blatant errors to the answers. Benchmarking six top-performing LLMs, including GPT-4o, they find that models are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge around 60% of the time. However, the more unrealistic the retrieved content is, the less likely the model is to adopt it. The authors also demonstrate simple methods for improving model accuracy when there is conflicting retrieved content. This paper highlights a difficult task and benchmark for LLMs – namely, their ability to correctly discern when they are wrong in light of correct retrieved content.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This study looks at how language models handle information from outside sources. Sometimes, this information can be wrong or even harmful. The researchers created a big dataset with questions about different topics and made mistakes in the answers. They tested six top language models, including GPT-4o, to see how they handled this incorrect information. They found that most models would use the wrong information over 60% of the time, but if the mistake was really obvious, they were less likely to do so. The researchers also showed ways to make these models better at handling conflicting information.

Keywords

» Artificial intelligence » Gpt » Rag » Retrieval augmented generation

ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence

by Kevin Wu, Eric Wu, James Zou

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Mmina: Benchmarking Multihop Multimodal Internet Agents, by Ziniu Zhang et al.

Summary of A Computer Vision-based Quality Assessment Technique For the Automatic Control Of Consumables For Analytical Laboratories, by Meriam Zribi et al.

Related Posts