Loading Now

Summary of Retrievalattention: Accelerating Long-context Llm Inference Via Vector Retrieval, by Di Liu et al.


RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

by Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu

First submitted to arxiv on: 16 Sep 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel approach called RetrievalAttention that accelerates attention computation and reduces GPU memory consumption in Transformer-based Large Language Models (LLMs). The problem lies in the quadratic time complexity of attention computation, which makes scaling LLMs to longer contexts extremely slow. To address this challenge, the authors leverage the dynamic sparsity of attention mechanisms to build approximate nearest neighbor search indexes for key-value vectors on CPU memory and retrieve relevant ones during generation. However, they observe that off-the-shelf ANNS indexes are ineffective due to out-of-distribution (OOD) between query and key vectors in the attention mechanism. To mitigate this issue, an attention-aware vector search algorithm is designed to adapt to query vector distributions. The evaluation shows that RetrievalAttention achieves near-full attention accuracy while requiring only 1-3% of the data, resulting in significant reduction in inference cost with a lower GPU memory footprint.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper talks about how big language models can be super slow and use too much computer power when they’re really long. The problem is that it takes a lot of time to figure out what’s important and what’s not. But the authors came up with an idea called RetrievalAttention that makes it faster and uses less memory. They did this by making a special kind of search index on a different part of the computer, which helps find the most important things quickly. The problem is that this index isn’t very good at finding what’s important because it’s not used to language models like these. So they made another thing called attention-aware vector search that makes the index better for language models. When they tested it, they found that RetrievalAttention worked really well and was much faster than before.

Keywords

» Artificial intelligence  » Attention  » Inference  » Nearest neighbor  » Transformer