Summary of Dha: Learning Decoupled-head Attention From Transformer Checkpoints Via Adaptive Heads Fusion, by Yilong Chen et al.
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
by Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun
First submitted to arxiv on: 3 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large language models with billions of parameters demonstrate impressive performance, but the widely used Multi-Head Attention (MHA) incurs substantial computational and memory costs during inference. The proposed Decoupled-Head Attention (DHA) mechanism adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. By transforming MHA checkpoints into DHA models through linear fusion of similar head parameters, the approach requires a mere 0.25% of the original model’s pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5x training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models are really smart, but they use up lots of computer power and memory when they make predictions. The researchers came up with a new way to make these models work more efficiently without losing their ability to learn. They tested this new method on different sized models and found that it could save 75% of the computer resources needed while still performing almost as well. |
Keywords
» Artificial intelligence » Attention » Inference » Multi head attention