Loading Now

Summary of Hydragen: High-throughput Llm Inference with Shared Prefixes, by Jordan Juravsky et al.


Hydragen: High-Throughput LLM Inference with Shared Prefixes

by Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini

First submitted to arxiv on: 7 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces Hydragen, a hardware-aware exact implementation of attention for large language models (LLMs) that significantly improves decoding efficiency in batched inference scenarios. The authors identify the attention operation as a bottleneck in LLM inference, particularly when processing batches of sequences with shared prefixes. To address this challenge, they propose a decomposition-based approach that separates prefix and suffix attentions, enabling efficient computation and reducing memory reads. Experimental results show that Hydragen can achieve up to 32x speedup compared to competitive baselines on the CodeLlama-13b model, with the speedup growing with batch size and shared prefix length. Additionally, the authors demonstrate the applicability of Hydragen to tree-based prompt sharing patterns, leading to a 55% reduction in inference time for competitive programming problems.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us make language models like CodeLlama-13b faster and more efficient when we use them to understand or generate text. Right now, these models get slow when they have to process many pieces of text that start with the same thing, like a chatbot prompt. The researchers found that the “attention” part of the model is what slows it down in this situation. So, they developed a new way to do attention that takes advantage of the fact that some parts of the text are the same. This makes their model run up to 32 times faster than others! They also showed how this can be used to make other kinds of models work better too.

Keywords

* Artificial intelligence  * Attention  * Inference  * Prompt