Summary of Repliqa: a Question-answering Dataset For Benchmarking Llms on Unseen Reference Content, by Joao Monteiro et al.
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
by Joao Monteiro, Pierre-Andre Noel, Etienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vazquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
First submitted to arxiv on: 17 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) are trained on vast amounts of data, including encyclopedic documents and benchmark datasets. To evaluate these models accurately, we introduce a new test dataset called RepLiQA, consisting of five splits of test sets that have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA includes a reference document, question about the topic, ground-truth answer, and paragraph containing the answer. We run a large-scale benchmark using several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models are trained on lots of data from the internet. But this data can be messy, with some parts being used for testing models too! To fix this, we created a new test dataset called RepLiQA that has separate groups of questions and answers not found online or in model APIs before now. Each group has a fake news article, a question about it, the correct answer, and the paragraph where the answer is. We tested many top-performing models to see how they do on this new dataset. |