Summary of Fables: Evaluating Faithfulness and Content Selection in Book-length Summarization, by Yekyung Kim et al.

FABLES: Evaluating faithfulness and content selection in book-length summarization

by Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

First submitted to arxiv on: 1 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper explores the faithfulness and content selection capabilities of large language models (LLMs) in generating summaries of book-length documents. The study conducts a large-scale human evaluation of LLM-generated summaries, focusing on summaries of fictional books published in 2023 or 2024. The results show that Claude-3-Opus outperforms closed-source LLMs in faithfulness, while Mixtral is comparable to GPT-3.5-Turbo. Analysis reveals that most unfaithful claims relate to events and character states, requiring indirect reasoning. While LLM-based auto-raters are reliable for factuality and coherence, they struggle with detecting unfaithful claims. The study highlights the importance of detecting unfaithful claims and explores content selection errors in book-length summarization.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper looks at how well big computer models can summarize really long books. They did a big test where people read 26 books and then checked if the computer’s summary was accurate. The best model, Claude-3-Opus, did better than other models at making sure its summary was true. But it’s hard to tell when the computer is getting something wrong. The study shows that computers are good at summarizing big books, but they can make mistakes.

Keywords

» Artificial intelligence » Claude » Gpt » Summarization

FABLES: Evaluating faithfulness and content selection in book-length summarization

by Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Self-demos: Eliciting Out-of-demonstration Generalizability in Large Language Models, by Wei He et al.

Summary of Towards Safety and Helpfulness Balanced Responses Via Controllable Large Language Models, by Yi-lin Tuan et al.

Related Posts