Loading Now

Summary of Loma: Lossless Compressed Memory Attention, by Yumeng Wang et al.


LoMA: Lossless Compressed Memory Attention

by Yumeng Wang, Zhenyang Xiao

First submitted to arxiv on: 16 Jan 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach called Lossless Compressed Memory Attention (LoMA) is introduced for Large Language Models (LLMs), addressing the limitations of high demand on GPU memory and computational resources when handling long contexts. LoMA enables lossless compression of the Key-Value (KV) cache, reducing memory and computational demands during autoregressive generation. A specialized training procedure and optimized autoregressive generation algorithm are used to compress the KV cache after every generated tokens with a compression ratio of c and target compressed length t, within a single inference pass without dependency on auxiliary models. Experimental validation demonstrates that LoMA significantly reduces computational consumption and memory usage through achieving lossless KV cache compression.
Low GrooveSquid.com (original content) Low Difficulty Summary
LoMA is a new way to help large language models use less computer power and memory when they need to generate lots of text at once. Right now, these models can be slow and use up too many resources because they have to keep all the information they’ve learned in their “memory”. LoMA lets them compress this information without losing any of it, making them faster and more efficient.

Keywords

* Artificial intelligence  * Attention  * Autoregressive  * Inference