Summary of Infini-gram: Scaling Unbounded N-gram Language Models to a Trillion Tokens, by Jiacheng Liu et al.
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
First submitted to arxiv on: 30 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper aims to revitalize n-gram language models (LLMs) by modernizing them for large-scale training and arbitrary n values. The authors train the largest-ever-built n-gram LM on 5 trillion tokens, and introduce a new -gram LM with backoff. They also develop an engine called infini-gram, powered by suffix arrays, which can compute -gram probabilities at millisecond-level latency. The -gram framework enables novel analyses of human-written and machine-generated text. Results show that the -gram LM achieves 47% accuracy for next-token prediction and complements neural LLMs to reduce perplexity. Furthermore, analyzing machine-generated text reveals irregularities in the agreement level with respect to suffix length, indicating deficiencies in neural LLM pretraining and positional embeddings of Transformers. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about updating language models called n-gram LMs to make them more powerful. The authors train their updated model on a huge dataset and also create a new way to calculate probabilities quickly. They use this new method to analyze human-written and machine-generated text, which shows that the updated model can predict what comes next in text with high accuracy. This is useful for improving language models like neural LLMs. The study also finds some problems with how current language models are trained, which could be fixed by using these updated n-gram LMs. |
Keywords
» Artificial intelligence » Perplexity » Pretraining » Token