Summary of Scaling Retrieval-based Language Models with a Trillion-token Datastore, by Rulin Shao et al.
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
by Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
First submitted to arxiv on: 9 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the scalability of language models (LMs) by considering an additional dimension: the size of the datastore used for inference. Researchers find that increasing the size of the datastore monotonically improves LMs’ performance on language modeling and downstream tasks, without saturation, making a smaller model with a large datastore outperform a larger LM-only model on knowledge-intensive tasks. The study plots compute-optimal scaling curves with varied datastore, model, and pretraining data sizes to show that using larger datastores can significantly improve model performance for the same training compute budget. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary In simple terms, this paper looks at how making language models “smarter” by giving them more information to work with makes them better at understanding and completing tasks. The researchers created a massive dataset called MassiveDS, which is the largest open-source dataset of its kind, and designed an efficient way to study how different sizes of datasets affect model performance. |
Keywords
* Artificial intelligence * Inference * Pretraining