Summary of Benchmarking Benchmark Leakage in Large Language Models, by Ruijie Xu et al.

Benchmarking Benchmark Leakage in Large Language Models

by Ruijie Xu, Zengzhi Wang, Run-Ze Fan, Pengfei Liu

First submitted to arxiv on: 29 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the issue of dataset leakage in pre-training data for Large Language Models (LLMs). The authors argue that this phenomenon can lead to unfair comparisons and hinder the development of LLMs. To address this problem, they propose a detection pipeline using Perplexity and N-gram accuracy metrics to identify potential data leakages. They apply their method to 31 LLMs for mathematical reasoning tasks and find significant instances of test set misuse, leading to potentially biased benchmarking results. The authors recommend improved model documentation, benchmark setup, and evaluation practices to promote transparency in LLM development. They also propose the “Benchmark Transparency Card” as a tool to encourage clear reporting of benchmark usage. The paper’s findings and resources are publicly available for further research.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This study looks at how Large Language Models (LLMs) are trained using data from multiple sources. The authors found that some LLMs are actually learning from test data meant to be kept secret, which is not fair. To fix this problem, they developed a way to detect when an LLM is using the wrong data by looking at how well it performs on simple tasks like math problems. They tested their method on 31 different LLMs and found that many of them were using test data in ways that are unfair. The authors suggest that models should be more transparent about how they’re trained, and that we should all follow some basic rules to make sure comparisons between models are fair.

Keywords

» Artificial intelligence » N gram » Perplexity

Benchmarking Benchmark Leakage in Large Language Models

by Ruijie Xu, Zengzhi Wang, Run-Ze Fan, Pengfei Liu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Terrain Characterisation For Online Adaptability Of Automated Sonar Processing: Lessons Learnt From Operationally Applying Atr to Sidescan Sonar in Mcm Applications, by Thomas Guerneve et al.

Summary of Kangaroo: Lossless Self-speculative Decoding Via Double Early Exiting, by Fangcheng Liu et al.

Related Posts