Loading Now

Summary of Turning Trash Into Treasure: Accelerating Inference Of Large Language Models with Token Recycling, by Xianzhen Luo et al.


Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

by Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che

First submitted to arxiv on: 16 Aug 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper addresses the challenge of inference latency in large language models (LLMs) as they scale up in size. Speculative decoding is proposed as a lossless approach to accelerate inference, leveraging parallel processing capabilities. The method relies on building libraries from pre-existing corpora or generating n-grams, but faces challenges such as storage requirements and time-consuming retrieval. To overcome these limitations, the authors introduce Token Recycling, which stores candidate tokens in an adjacency matrix and uses a breadth-first search (BFS) algorithm to construct a draft tree. This approach requires minimal additional storage and achieves a 2x speedup across all LLM sizes, outperforming existing train-free methods by 30% and a training method by 25%. The proposed method can be applied directly to any existing LLMs and tasks without adaptation.
Low GrooveSquid.com (original content) Low Difficulty Summary
Token Recycling is a new approach to accelerate inference in large language models. The method stores candidate tokens in an adjacency matrix and uses a breadth-first search algorithm to construct a draft tree. This allows for faster processing and requires minimal additional storage. The proposed method outperforms existing train-free methods by 30% and even a training method by 25%. It can be used with any existing LLMs and tasks without adaptation.

Keywords

» Artificial intelligence  » Inference  » Token