Summary of Bass: Batched Attention-optimized Speculative Sampling, by Haifeng Qian et al.
BASS: Batched Attention-optimized Speculative Sampling
by Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras
First submitted to arxiv on: 24 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces a system for batched speculative decoding, which enables fast generation of multiple sequences while preserving latency benefits. Speculative decoding has been shown to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence, leaving open the challenge of performing speculative decoding in a batched setting. The proposed system sets a new state-of-the-art in multi-sequence generation latency, demonstrating superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. The system also achieves state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Additionally, it generates sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making computers generate text faster and better. Right now, computers can only generate one piece of text at a time. But sometimes we need them to generate multiple pieces of text quickly. This paper describes a new way for computers to generate text in batches, which makes it much faster. For example, with this new method, a computer can generate 1,100 tokens (small units of text) per second. That’s really fast! The new method also allows the computer to use its processing power more efficiently. It can even generate text that is as good or better than what a human would write. |
Keywords
» Artificial intelligence » Token