Summary of Long-span Question-answering: Automatic Question Generation and Qa-system Ranking Via Side-by-side Evaluation, by Bernd Bohnet et al.

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

by Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

First submitted to arxiv on: 31 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the potential of large language models (LLMs) to generate synthetic reading comprehension data from entire books. By leveraging LLMs’ long-context capabilities, researchers can create datasets that test a model’s ability to analyze, understand, and reason over complex texts, such as those requiring knowledge of character arcs, broader themes, or consequences of early actions later in the story. The authors propose a holistic pipeline for automatic data generation, including question generation, answering, and model scoring using an “Evaluator.” They find that a relative approach provides a more consistent and differentiating scoring mechanism than an absolute scorer. Furthermore, they show that LLMs from different model families produce moderate agreement in their ratings. The authors ground their approach using the manually curated NarrativeQA dataset, where their evaluator shows excellent agreement with human judgment and even finds errors in the dataset.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper uses special computer models to create new data for testing how well these models understand books. They want to see if these models can figure out complex ideas that require understanding big parts of a story. To do this, they use a special process that asks questions, answers them, and then compares the answers between different models. This helps decide which model is best at understanding. They tested their approach using a dataset created by people and found it agrees with how people would rate answers. This means their method is useful for testing language models’ reading comprehension abilities.

Keywords

» Artificial intelligence

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

by Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Paths Of a Million People: Extracting Life Trajectories From Wikipedia, by Ying Zhang et al.

Summary of Pedestrian Intention Prediction in Adverse Weather Conditions with Spiking Neural Networks and Dynamic Vision Sensors, by Mustafa Sakhai et al.

Related Posts