Summary of Generalization or Memorization: Data Contamination and Trustworthy Evaluation For Large Language Models, by Yihong Dong et al.
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
by Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, Ge Li
First submitted to arxiv on: 24 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The recent advancements in large language models (LLMs) have sparked concerns about data contamination due to the vast size and diverse sources of their training data. As LLMs are often evaluated on open-access benchmarks, it is possible for them to be more susceptible to data contamination. This paper proposes CDD (Contamination Detection via output Distribution), a method that detects data contamination by analyzing the peakedness of an LLM’s output distribution. Additionally, the authors present TED (Trustworthy Evaluation via output Distribution) to mitigate the impact of data contamination in evaluation. Two new benchmarks, DetCon and ComiEval, are introduced for data contamination detection and mitigation evaluation tasks. The proposed methods demonstrate significant improvements over existing approaches in detecting implicit contamination. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models have made impressive progress recently, but there’s a concern that they might be biased or fake because of the way they’re trained. This paper wants to fix this problem by making sure the results are fair and honest. They propose two new ways to detect and correct problems in the data: CDD (Contamination Detection) and TED (Trustworthy Evaluation). These methods can help us understand if a model is telling the truth or not, and make sure we’re not getting fake results. |