Loading Now

Summary of One Thousand and One Pairs: a “novel” Challenge For Long-context Language Models, by Marzena Karpinska et al.


One Thousand and One Pairs: A “novel” challenge for long-context language models

by Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

First submitted to arxiv on: 24 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a new dataset, NoCha, designed to test long-context language models’ (LLMs) ability to retrieve, synthesize, and reason over information across book-length inputs. Unlike existing synthetic benchmarks that only evaluate surface-level retrieval capabilities, NoCha contains 1,001 pairs of true and false claims about recently-published English fictional books, requiring global reasoning to verify. The authors evaluate ten LLMs, including GPT-4o, and find that none outperform random chance on the task, with GPT-4o achieving an accuracy of 55.8%. Analysis reveals models perform better on sentence-level retrieval and worse on speculative fiction books. The proposed methodology enables easy analysis of future models.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a new way to test how well language models can understand big books. They make a dataset called NoCha with pairs of true and false claims about recently-published book series. Unlike previous tests that only looked at surface-level information, NoCha requires readers to think globally about the entire book. The authors tested 10 language models and found that none could do better than chance. They also discovered that models are better at understanding simple sentences rather than complex ideas from science fiction books.

Keywords

» Artificial intelligence  » Gpt