Summary of Dha: Learning Decoupled-head Attention From Transformer Checkpoints Via Adaptive Heads Fusion, by Yilong Chen et al.

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

by Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

First submitted to arxiv on: 3 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Large language models with billions of parameters demonstrate impressive performance, but the widely used Multi-Head Attention (MHA) incurs substantial computational and memory costs during inference. The proposed Decoupled-Head Attention (DHA) mechanism adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. By transforming MHA checkpoints into DHA models through linear fusion of similar head parameters, the approach requires a mere 0.25% of the original model’s pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5x training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models are really smart, but they use up lots of computer power and memory when they make predictions. The researchers came up with a new way to make these models work more efficiently without losing their ability to learn. They tested this new method on different sized models and found that it could save 75% of the computer resources needed while still performing almost as well.

Keywords

» Artificial intelligence » Attention » Inference » Multi head attention

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

by Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Verification-guided Shielding For Deep Reinforcement Learning, by Davide Corsi et al.

Summary of Exploring Multilingual Large Language Models For Enhanced Tnm Classification Of Radiology Report in Lung Cancer Staging, by Hidetoshi Matsuo et al.

Related Posts