Summary of A Claim Decomposition Benchmark For Long-form Answer Verification, by Zhihao Zhang and Yixing Fan and Ruqing Zhang and Jiafeng Guo
A Claim Decomposition Benchmark for Long-form Answer Verification
by Zhihao Zhang, Yixing Fan, Ruqing Zhang, Jiafeng Guo
First submitted to arxiv on: 16 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The advancement of Large Language Models (LLMs) has led to significant improvements in complex long-form question answering tasks. However, one major issue with LLMs is the generation of “hallucination” responses that are not factual. To address this, providing accurate citations for each claim in responses becomes a crucial solution to improve factuality and verifiability. Existing research focuses on providing accurate citations, but largely overlooks identifying claims or statements within responses. This paper introduces a new claim decomposition benchmark, requiring systems to identify atomic and checkworthy claims for LLM responses. A Chinese Atomic Claim Decomposition Dataset (CACDD) is presented, which builds upon the WebCPM dataset with additional expert annotations for high-quality data. The CACDD contains 500 human-annotated question-answer pairs, including 4956 atomic claims. A new pipeline for human annotation and challenges are described, along with experiment results on zero-shot, few-shot, and fine-tuned LLMs as baselines. Results show that claim decomposition is highly challenging and requires further exploration. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary LLMs have improved question answering tasks, but generated responses often lack facts. To fix this, we need to identify claims within those responses. Existing research focuses on providing accurate citations, but we’re missing the most important part: identifying what’s being said. This paper introduces a new way of measuring how well LLMs do at breaking down these complex claims into smaller, factual parts. We created a special dataset with 500 questions and answers, where each claim is carefully labeled by experts. We also tested different types of LLMs to see which ones are best at this task. The results show that it’s really hard for even the best models to do well on this task. |
Keywords
» Artificial intelligence » Few shot » Hallucination » Question answering » Zero shot