Summary of Hydragen: High-throughput Llm Inference with Shared Prefixes, by Jordan Juravsky et al.
Hydragen: High-Throughput LLM Inference with Shared Prefixes
by Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini
First submitted to arxiv on: 7 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces Hydragen, a hardware-aware exact implementation of attention for large language models (LLMs) that significantly improves decoding efficiency in batched inference scenarios. The authors identify the attention operation as a bottleneck in LLM inference, particularly when processing batches of sequences with shared prefixes. To address this challenge, they propose a decomposition-based approach that separates prefix and suffix attentions, enabling efficient computation and reducing memory reads. Experimental results show that Hydragen can achieve up to 32x speedup compared to competitive baselines on the CodeLlama-13b model, with the speedup growing with batch size and shared prefix length. Additionally, the authors demonstrate the applicability of Hydragen to tree-based prompt sharing patterns, leading to a 55% reduction in inference time for competitive programming problems. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us make language models like CodeLlama-13b faster and more efficient when we use them to understand or generate text. Right now, these models get slow when they have to process many pieces of text that start with the same thing, like a chatbot prompt. The researchers found that the “attention” part of the model is what slows it down in this situation. So, they developed a new way to do attention that takes advantage of the fact that some parts of the text are the same. This makes their model run up to 32 times faster than others! They also showed how this can be used to make other kinds of models work better too. |
Keywords
* Artificial intelligence * Attention * Inference * Prompt