Summary of Large Language Models Relearn Removed Concepts, by Michelle Lo et al.
Large Language Models Relearn Removed Concepts
by Michelle Lo, Shay B. Cohen, Fazl Barez
First submitted to arxiv on: 3 Jan 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the ability of large language models (LLMs) to recover from neuron pruning, a technique used to remove undesirable concepts from their architecture. The authors track concept saliency and similarity during retraining to evaluate concept relearning in pruned models. They find that LLMs can quickly regain performance by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics, demonstrating polysemantic capacities and the ability to blend old and new concepts. While neuron pruning provides interpretability into model concepts, the results highlight the challenges of permanent concept removal for improved model safety. The authors suggest that monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research looks at how well big language models can recover after some parts are removed, a technique called neuron pruning. The scientists want to know if the models can learn again after losing some information. They track what’s important in the model during retraining and find that it can quickly get back to normal by moving advanced ideas to earlier parts and giving old ideas new homes with similar meanings. This shows that language models are very flexible and can combine old and new ideas in one place. While removing parts helps us understand what’s inside the model, this study highlights the challenges of making sure nothing bad gets learned again. The researchers think it’s important to keep an eye on what’s being relearned and find ways to stop bad ideas from coming back. |
Keywords
» Artificial intelligence » Pruning » Semantics