Summary of One Thousand and One Pairs: a “novel” Challenge For Long-context Language Models, by Marzena Karpinska et al.
One Thousand and One Pairs: A “novel” challenge for long-context language models
by Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer
First submitted to arxiv on: 24 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a new dataset, NoCha, designed to test long-context language models’ (LLMs) ability to retrieve, synthesize, and reason over information across book-length inputs. Unlike existing synthetic benchmarks that only evaluate surface-level retrieval capabilities, NoCha contains 1,001 pairs of true and false claims about recently-published English fictional books, requiring global reasoning to verify. The authors evaluate ten LLMs, including GPT-4o, and find that none outperform random chance on the task, with GPT-4o achieving an accuracy of 55.8%. Analysis reveals models perform better on sentence-level retrieval and worse on speculative fiction books. The proposed methodology enables easy analysis of future models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a new way to test how well language models can understand big books. They make a dataset called NoCha with pairs of true and false claims about recently-published book series. Unlike previous tests that only looked at surface-level information, NoCha requires readers to think globally about the entire book. The authors tested 10 language models and found that none could do better than chance. They also discovered that models are better at understanding simple sentences rather than complex ideas from science fiction books. |
Keywords
» Artificial intelligence » Gpt