Summary of Long-span Question-answering: Automatic Question Generation and Qa-system Ranking Via Side-by-side Evaluation, by Bernd Bohnet et al.
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
by Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel
First submitted to arxiv on: 31 May 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the potential of large language models (LLMs) to generate synthetic reading comprehension data from entire books. By leveraging LLMs’ long-context capabilities, researchers can create datasets that test a model’s ability to analyze, understand, and reason over complex texts, such as those requiring knowledge of character arcs, broader themes, or consequences of early actions later in the story. The authors propose a holistic pipeline for automatic data generation, including question generation, answering, and model scoring using an “Evaluator.” They find that a relative approach provides a more consistent and differentiating scoring mechanism than an absolute scorer. Furthermore, they show that LLMs from different model families produce moderate agreement in their ratings. The authors ground their approach using the manually curated NarrativeQA dataset, where their evaluator shows excellent agreement with human judgment and even finds errors in the dataset. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper uses special computer models to create new data for testing how well these models understand books. They want to see if these models can figure out complex ideas that require understanding big parts of a story. To do this, they use a special process that asks questions, answers them, and then compares the answers between different models. This helps decide which model is best at understanding. They tested their approach using a dataset created by people and found it agrees with how people would rate answers. This means their method is useful for testing language models’ reading comprehension abilities. |