Summary of Evaluating Large Language Models For Generalization and Robustness Via Data Compression, by Yucheng Li et al.
Evaluating Large Language Models for Generalization and Robustness via Data Compression
by Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin
First submitted to arxiv on: 1 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed evaluation approach for large language models addresses challenges like data contamination, prompt sensitivity, and high benchmark creation costs. The method uses lossless data compression to test how models generalize after their training cutoff date. A comprehensive dataset spanning 83 months (2017-2023) is split into training and testing periods based on the models’ training data cutoffs. The approach measures two aspects: compression performance on the testing period, indicating generalization on unseen data; and the performance gap between the training and testing periods, measuring robustness. Experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. Results show that many models’ compression rates drop significantly after their cutoff date, but some models like Mistral and Llama-2 demonstrate a good balance between performance and robustness. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models are being used more and more to process human language. But how do we know if they’re doing a good job? There are problems with the way we usually test them, like using fake data or testing on data that’s too similar to what they were trained on. To fix this, scientists have come up with a new way to test these models. They use something called lossless data compression to see how well the models can predict things after their training is done. This helps us understand if they can work well on new information or just stick to what they know. |
Keywords
» Artificial intelligence » Generalization » Llama » Multi modal » Prompt