Summary of Block Transformer: Global-to-local Language Modeling For Fast Inference, by Namgyu Ho et al.
Block Transformer: Global-to-Local Language Modeling for Fast Inference
by Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun
First submitted to arxiv on: 4 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The Block Transformer, a novel approach to autoregressive transformers, tackles the inference bottlenecks associated with self-attention by adopting hierarchical global-to-local modeling. This addresses two primary issues: the delay in obtaining the first token due to processing the entire prompt, and the high memory I/O demand of fetching the entire key-value cache. The Block Transformer integrates coarsity and locality into a global-to-local architecture, using coarse-grained attention at lower layers to capture global context while minimizing KV cache overhead, and fine-grained attention within each block at upper layers to model local details with a lightweight cache. Pretraining vanilla and Block Transformers from scratch demonstrates that the latter achieves 10-20x inference throughput compared to vanilla transformers, with equivalent perplexity and zero-shot task performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The Block Transformer is a new way of building autoregressive transformers. It solves two big problems: it makes the first token appear faster, and it uses less memory when processing long sequences. The Block Transformer does this by dividing the sequence into blocks and using attention at different scales to capture both global and local information. By doing so, it can process sequences much faster than traditional transformers while still maintaining good performance. |
Keywords
» Artificial intelligence » Attention » Autoregressive » Inference » Perplexity » Pretraining » Prompt » Self attention » Token » Transformer » Zero shot