Summary of Turning Trash Into Treasure: Accelerating Inference Of Large Language Models with Token Recycling, by Xianzhen Luo et al.

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

by Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che

First submitted to arxiv on: 16 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper addresses the challenge of inference latency in large language models (LLMs) as they scale up in size. Speculative decoding is proposed as a lossless approach to accelerate inference, leveraging parallel processing capabilities. The method relies on building libraries from pre-existing corpora or generating n-grams, but faces challenges such as storage requirements and time-consuming retrieval. To overcome these limitations, the authors introduce Token Recycling, which stores candidate tokens in an adjacency matrix and uses a breadth-first search (BFS) algorithm to construct a draft tree. This approach requires minimal additional storage and achieves a 2x speedup across all LLM sizes, outperforming existing train-free methods by 30% and a training method by 25%. The proposed method can be applied directly to any existing LLMs and tasks without adaptation.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Token Recycling is a new approach to accelerate inference in large language models. The method stores candidate tokens in an adjacency matrix and uses a breadth-first search algorithm to construct a draft tree. This allows for faster processing and requires minimal additional storage. The proposed method outperforms existing train-free methods by 30% and even a training method by 25%. It can be used with any existing LLMs and tasks without adaptation.

Keywords

» Artificial intelligence » Inference » Token

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

by Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Textcavs: Debugging Vision Models Using Text, by Angus Nicolson et al.

Summary of Neighbor Overlay-induced Graph Attention Network, by Tiqiao Wei and Ye Yuan

Related Posts