Loading Now

Summary of Beyond Uniform Query Distribution: Key-driven Grouped Query Attention, by Zohaib Khan et al.


Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

by Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala, Agha Ali Raza

First submitted to arxiv on: 15 Aug 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads – reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper improves upon the Transformer architecture by addressing the memory footprint issue with long-sequence tasks. They propose new ways to group queries and key-value heads called GQA, KDGQA, DGQA, and PGQA. These methods use information from the norms of the key heads to inform query allocation. The results show that these variants can improve upon the original GQA through more informed and adaptive grouping mechanisms. This could lead to better performance in image classification tasks.

Keywords

» Artificial intelligence  » Attention  » Deep learning  » Image classification  » Self attention  » Transformer  » Vit