Loading Now

Summary of Epic: Efficient Position-independent Context Caching For Serving Large Language Models, by Junhao Hu et al.


EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models

by Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

First submitted to arxiv on: 20 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) are essential for various applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Existing context caching methods improve performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, reducing time-to-first-token (TTFT). However, these methods require exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). The EPIC system features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Experimental results demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making it faster to use Large Language Models (LLMs) for things like answering questions or generating text. Right now, as inputs get more complex, it takes a long time to get the first result. A technique called context caching can help by using what’s already known from previous requests. But this method only works if the input is exactly the same. The authors of this paper developed a new way to cache information that doesn’t depend on the exact position of words in the sentence. They also created two tools, AttnLink and KVSplit, that work together to make this new caching system faster and more efficient.

Keywords

» Artificial intelligence  » Attention  » Few shot  » Retrieval augmented generation  » Token