Summary of Infini-gram: Scaling Unbounded N-gram Language Models to a Trillion Tokens, by Jiacheng Liu et al.

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

First submitted to arxiv on: 30 Jan 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper aims to revitalize n-gram language models (LLMs) by modernizing them for large-scale training and arbitrary n values. The authors train the largest-ever-built n-gram LM on 5 trillion tokens, and introduce a new -gram LM with backoff. They also develop an engine called infini-gram, powered by suffix arrays, which can compute -gram probabilities at millisecond-level latency. The -gram framework enables novel analyses of human-written and machine-generated text. Results show that the -gram LM achieves 47% accuracy for next-token prediction and complements neural LLMs to reduce perplexity. Furthermore, analyzing machine-generated text reveals irregularities in the agreement level with respect to suffix length, indicating deficiencies in neural LLM pretraining and positional embeddings of Transformers.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about updating language models called n-gram LMs to make them more powerful. The authors train their updated model on a huge dataset and also create a new way to calculate probabilities quickly. They use this new method to analyze human-written and machine-generated text, which shows that the updated model can predict what comes next in text with high accuracy. This is useful for improving language models like neural LLMs. The study also finds some problems with how current language models are trained, which could be fixed by using these updated n-gram LMs.

Keywords

* Artificial intelligence * Perplexity * Pretraining * Token

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Arabic Tweet Act: a Weighted Ensemble Pre-trained Transformer Model For Classifying Arabic Speech Acts on Twitter, by Khadejaa Alshehri et al.

Summary of Customizing Language Model Responses with Contrastive In-context Learning, by Xiang Gao et al.

Related Posts