Summary of Does Roberta Perform Better Than Bert in Continual Learning: An Attention Sink Perspective, by Xueying Bai et al.
Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective
by Xueying Bai, Yifan Sun, Niranjan Balasubramanian
First submitted to arxiv on: 8 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the relationship between pre-training and continual learning (CL) in machine learning models. Researchers found that high-capacity pre-trained models may allocate attention scores to “sink” tokens, such as [SEP] tokens, which can lead to over-smoothing in single-task learning and interference in sequential tasks’ learning, ultimately affecting CL performance. To address this issue, the authors propose a novel mechanism called pre-scaling, which encourages attention diversity across all tokens by scaling task attention to non-sink tokens during probing and fine-tuning. Experimental results demonstrate significant improvements in CL without experience replay or progressive parameter storage. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how well machine learning models can learn new tasks while remembering old ones. Researchers found that some pre-trained models might get stuck on certain words, like special tokens, which can make them forget what they learned earlier. To solve this problem, the authors came up with a new way to adjust attention in these models. This new method helps them focus more evenly across all words and remember previous tasks better. The results show that this approach improves how well the models learn new things without needing to store old information. |
Keywords
» Artificial intelligence » Attention » Continual learning » Fine tuning » Machine learning