Loading Now

Summary of Block Transformer: Global-to-local Language Modeling For Fast Inference, by Namgyu Ho et al.


Block Transformer: Global-to-Local Language Modeling for Fast Inference

by Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

First submitted to arxiv on: 4 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The Block Transformer, a novel approach to autoregressive transformers, tackles the inference bottlenecks associated with self-attention by adopting hierarchical global-to-local modeling. This addresses two primary issues: the delay in obtaining the first token due to processing the entire prompt, and the high memory I/O demand of fetching the entire key-value cache. The Block Transformer integrates coarsity and locality into a global-to-local architecture, using coarse-grained attention at lower layers to capture global context while minimizing KV cache overhead, and fine-grained attention within each block at upper layers to model local details with a lightweight cache. Pretraining vanilla and Block Transformers from scratch demonstrates that the latter achieves 10-20x inference throughput compared to vanilla transformers, with equivalent perplexity and zero-shot task performance.
Low GrooveSquid.com (original content) Low Difficulty Summary
The Block Transformer is a new way of building autoregressive transformers. It solves two big problems: it makes the first token appear faster, and it uses less memory when processing long sequences. The Block Transformer does this by dividing the sequence into blocks and using attention at different scales to capture both global and local information. By doing so, it can process sequences much faster than traditional transformers while still maintaining good performance.

Keywords

» Artificial intelligence  » Attention  » Autoregressive  » Inference  » Perplexity  » Pretraining  » Prompt  » Self attention  » Token  » Transformer  » Zero shot