Loading Now

Summary of Correlation-aware Select and Merge Attention For Efficient Fine-tuning and Context Length Extension, by Ning Wang et al.


Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

by Ning Wang, Zekun Li, Tongxin Bai, Guoqi Li

First submitted to arxiv on: 5 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed attention architecture enables large language models to handle longer sequences with reduced computational resources and fine-tuning time. The model incorporates correlation-aware selection and merging mechanisms for efficient sparse attention and a novel data augmentation technique using positional encodings. The results show that the method can achieve fine-tuning on Llama2-7B with a sequence length of 32K, outperforming other methods that rely on subsets. The architecture also enables pre-training with partial translation invariance during token selection and applies positional encodings only to selected tokens. This approach achieves high performance and extrapolation capabilities. For fine-tuning, the method introduces Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRD NTK) that allows models like Llama2-7B and Mistral-7B to perform inference with context lengths up to 1M or arbitrary lengths.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper proposes a way for large language models to handle longer sequences without using too many resources. The model works by selecting the most important parts of the sequence and merging them together, which makes it more efficient. It also uses a new technique to make sure the model performs well on sequences that are different from what it was trained on. The results show that this method is much faster than other methods that try to handle longer sequences.

Keywords

» Artificial intelligence  » Attention  » Data augmentation  » Embedding  » Fine tuning  » Inference  » Token  » Translation