Loading Now

Summary of Fables: Evaluating Faithfulness and Content Selection in Book-length Summarization, by Yekyung Kim et al.


FABLES: Evaluating faithfulness and content selection in book-length summarization

by Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

First submitted to arxiv on: 1 Apr 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper explores the faithfulness and content selection capabilities of large language models (LLMs) in generating summaries of book-length documents. The study conducts a large-scale human evaluation of LLM-generated summaries, focusing on summaries of fictional books published in 2023 or 2024. The results show that Claude-3-Opus outperforms closed-source LLMs in faithfulness, while Mixtral is comparable to GPT-3.5-Turbo. Analysis reveals that most unfaithful claims relate to events and character states, requiring indirect reasoning. While LLM-based auto-raters are reliable for factuality and coherence, they struggle with detecting unfaithful claims. The study highlights the importance of detecting unfaithful claims and explores content selection errors in book-length summarization.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at how well big computer models can summarize really long books. They did a big test where people read 26 books and then checked if the computer’s summary was accurate. The best model, Claude-3-Opus, did better than other models at making sure its summary was true. But it’s hard to tell when the computer is getting something wrong. The study shows that computers are good at summarizing big books, but they can make mistakes.

Keywords

» Artificial intelligence  » Claude  » Gpt  » Summarization