Loading Now

Summary of Infini-gram: Scaling Unbounded N-gram Language Models to a Trillion Tokens, by Jiacheng Liu et al.


Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

First submitted to arxiv on: 30 Jan 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper aims to revitalize n-gram language models (LLMs) by modernizing them for large-scale training and arbitrary n values. The authors train the largest-ever-built n-gram LM on 5 trillion tokens, and introduce a new -gram LM with backoff. They also develop an engine called infini-gram, powered by suffix arrays, which can compute -gram probabilities at millisecond-level latency. The -gram framework enables novel analyses of human-written and machine-generated text. Results show that the -gram LM achieves 47% accuracy for next-token prediction and complements neural LLMs to reduce perplexity. Furthermore, analyzing machine-generated text reveals irregularities in the agreement level with respect to suffix length, indicating deficiencies in neural LLM pretraining and positional embeddings of Transformers.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about updating language models called n-gram LMs to make them more powerful. The authors train their updated model on a huge dataset and also create a new way to calculate probabilities quickly. They use this new method to analyze human-written and machine-generated text, which shows that the updated model can predict what comes next in text with high accuracy. This is useful for improving language models like neural LLMs. The study also finds some problems with how current language models are trained, which could be fixed by using these updated n-gram LMs.

Keywords

» Artificial intelligence  » Perplexity  » Pretraining  » Token