Summary of Block Transformer: Global-to-local Language Modeling For Fast Inference, by Namgyu Ho et al.

Block Transformer: Global-to-Local Language Modeling for Fast Inference

by Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

First submitted to arxiv on: 4 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The Block Transformer, a novel approach to autoregressive transformers, tackles the inference bottlenecks associated with self-attention by adopting hierarchical global-to-local modeling. This addresses two primary issues: the delay in obtaining the first token due to processing the entire prompt, and the high memory I/O demand of fetching the entire key-value cache. The Block Transformer integrates coarsity and locality into a global-to-local architecture, using coarse-grained attention at lower layers to capture global context while minimizing KV cache overhead, and fine-grained attention within each block at upper layers to model local details with a lightweight cache. Pretraining vanilla and Block Transformers from scratch demonstrates that the latter achieves 10-20x inference throughput compared to vanilla transformers, with equivalent perplexity and zero-shot task performance.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The Block Transformer is a new way of building autoregressive transformers. It solves two big problems: it makes the first token appear faster, and it uses less memory when processing long sequences. The Block Transformer does this by dividing the sequence into blocks and using attention at different scales to capture both global and local information. By doing so, it can process sequences much faster than traditional transformers while still maintaining good performance.

Keywords

» Artificial intelligence » Attention » Autoregressive » Inference » Perplexity » Pretraining » Prompt » Self attention » Token » Transformer » Zero shot

Block Transformer: Global-to-Local Language Modeling for Fast Inference

by Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of A Hybrid Numerical Methodology Coupling Reduced Order Modeling and Graph Neural Networks For Non-parametric Geometries: Applications to Structural Dynamics Problems, by Victor Matray (lmps) et al.

Summary of Aligning Large Language Models Via Fine-grained Supervision, by Dehong Xu et al.

Related Posts