Summary of Dha: Learning Decoupled-head Attention From Transformer Checkpoints Via Adaptive Heads Fusion, by Yilong Chen et al.
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusionby Yilong Chen, Linhao Zhang,…