Summary of Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?, by Aryan Sajith et al.
Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
by Aryan Sajith, Krishna Chaitanya Rao Kathala
First submitted to arxiv on: 24 Nov 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel study explores the interplay between training data quality and quantity on the performance of small language models (SLMs) using the TinyStories dataset. The investigation examines how varying dataset sizes (25% to 100%) and duplication rates (0% to 100%) impact model performance, evaluated through validation loss, accuracy, and perplexity metrics. Results reveal that training data quality has a more significant influence on SLMs’ overall performance, with minimal duplication positively affecting model accuracy (+0.87%) without significantly increasing perplexity (+0.52%). However, excessive duplication leads to pronounced performance degradation (-40% drop in accuracy at 100% duplication). This study’s findings have implications beyond just model performance, including the financial, computational, and environmental burdens associated with training large-scale models. By understanding the relative importance of data quality versus quantity, this research aims to democratize AI technology, making advanced models more accessible and sustainable for all. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how good or bad the training data is for small language models compared to having lots of data. They used a special dataset called TinyStories to test different amounts of data (25% to 100%) and how much it’s duplicated (0% to 100%). The results showed that using better data makes the model better, even if you have less data. A little bit of duplication can also make the model slightly better without making it worse. But too much duplication makes the model a lot worse. This study is important because training big models uses a lot of energy and money, which can be a problem for some people. By understanding how to use good data or not too much data, we might be able to make AI technology more fair and environmentally friendly. |
Keywords
» Artificial intelligence » Perplexity