Loading Now

Summary of Does Roberta Perform Better Than Bert in Continual Learning: An Attention Sink Perspective, by Xueying Bai et al.


Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective

by Xueying Bai, Yifan Sun, Niranjan Balasubramanian

First submitted to arxiv on: 8 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the relationship between pre-training and continual learning (CL) in machine learning models. Researchers found that high-capacity pre-trained models may allocate attention scores to “sink” tokens, such as [SEP] tokens, which can lead to over-smoothing in single-task learning and interference in sequential tasks’ learning, ultimately affecting CL performance. To address this issue, the authors propose a novel mechanism called pre-scaling, which encourages attention diversity across all tokens by scaling task attention to non-sink tokens during probing and fine-tuning. Experimental results demonstrate significant improvements in CL without experience replay or progressive parameter storage.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study looks at how well machine learning models can learn new tasks while remembering old ones. Researchers found that some pre-trained models might get stuck on certain words, like special tokens, which can make them forget what they learned earlier. To solve this problem, the authors came up with a new way to adjust attention in these models. This new method helps them focus more evenly across all words and remember previous tasks better. The results show that this approach improves how well the models learn new things without needing to store old information.

Keywords

» Artificial intelligence  » Attention  » Continual learning  » Fine tuning  » Machine learning