Summary of Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?, by Aryan Sajith et al.

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

by Aryan Sajith, Krishna Chaitanya Rao Kathala

First submitted to arxiv on: 24 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel study explores the interplay between training data quality and quantity on the performance of small language models (SLMs) using the TinyStories dataset. The investigation examines how varying dataset sizes (25% to 100%) and duplication rates (0% to 100%) impact model performance, evaluated through validation loss, accuracy, and perplexity metrics. Results reveal that training data quality has a more significant influence on SLMs’ overall performance, with minimal duplication positively affecting model accuracy (+0.87%) without significantly increasing perplexity (+0.52%). However, excessive duplication leads to pronounced performance degradation (-40% drop in accuracy at 100% duplication). This study’s findings have implications beyond just model performance, including the financial, computational, and environmental burdens associated with training large-scale models. By understanding the relative importance of data quality versus quantity, this research aims to democratize AI technology, making advanced models more accessible and sustainable for all.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This study looks at how good or bad the training data is for small language models compared to having lots of data. They used a special dataset called TinyStories to test different amounts of data (25% to 100%) and how much it’s duplicated (0% to 100%). The results showed that using better data makes the model better, even if you have less data. A little bit of duplication can also make the model slightly better without making it worse. But too much duplication makes the model a lot worse. This study is important because training big models uses a lot of energy and money, which can be a problem for some people. By understanding how to use good data or not too much data, we might be able to make AI technology more fair and environmentally friendly.

Keywords

» Artificial intelligence » Perplexity

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

by Aryan Sajith, Krishna Chaitanya Rao Kathala

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Proceedings Of the 6th International Workshop on Reading Music Systems, by Jorge Calvo-zaragoza et al.

Summary of Making Images From Images: Interleaving Denoising and Transformation, by Shumeet Baluja et al.

Related Posts