Summary of Deciphering the Impact Of Pretraining Data on Large Language Models Through Machine Unlearning, by Yang Zhao et al.
Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning
by Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Zhouhao Sun, Jun Shi, Ting Liu, Bing Qin
First submitted to arxiv on: 18 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) have achieved impressive performance through pretraining on various sources, but the impact of each component remains unclear. To address this issue, we analyzed the effect of 48 datasets from five major categories of pretraining data on LLMs’ performances using benchmarks for nine major categories of model capabilities. Our findings provide insights into the organization of data to support efficient pretraining of LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models have gotten very good at learning from lots of different sources, but we don’t really know which parts of those sources are making them so smart. We wanted to figure out how all these different sources are helping or hurting each other’s performance. So, we looked at 48 different sets of data and saw what happened when we used them to train the models. Our results showed that some data is super helpful for certain things, like understanding books. This helps us know how to organize our training data so it works better. |
Keywords
» Artificial intelligence » Pretraining