Loading Now

Summary of Dha: Learning Decoupled-head Attention From Transformer Checkpoints Via Adaptive Heads Fusion, by Yilong Chen et al.


DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

by Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

First submitted to arxiv on: 3 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large language models with billions of parameters demonstrate impressive performance, but the widely used Multi-Head Attention (MHA) incurs substantial computational and memory costs during inference. The proposed Decoupled-Head Attention (DHA) mechanism adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. By transforming MHA checkpoints into DHA models through linear fusion of similar head parameters, the approach requires a mere 0.25% of the original model’s pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5x training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models are really smart, but they use up lots of computer power and memory when they make predictions. The researchers came up with a new way to make these models work more efficiently without losing their ability to learn. They tested this new method on different sized models and found that it could save 75% of the computer resources needed while still performing almost as well.

Keywords

» Artificial intelligence  » Attention  » Inference  » Multi head attention