Summary of Efficient Stagewise Pretraining Via Progressive Subnetworks, by Abhishek Panigrahi et al.
Efficient Stagewise Pretraining via Progressive Subnetworks
by Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar
First submitted to arxiv on: 8 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper challenges the prevailing view that stagewise training approaches like layer dropping are ineffective. Instead, it proposes a principled framework called Progressive Subnetwork Training (PST) which trains subnetworks within the model and progressively increases their size during training. This approach is instantiated as Random Part Training (RAPTR), which selects and trains only a random subnetwork at each step, increasing its size in stages. The paper shows that RAPTR generalizes prior works on layer dropping, fixes their key issues, and provides theoretical justification for the approach. Experiments demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about a new way to train big language models faster. Usually, people think that dropping layers in the model doesn’t work well. But this paper shows that with a special approach called Random Part Training (RAPTR), you can actually make it work better! RAPTR trains small parts of the model and then gets bigger step by step. It’s like building a big tower block by block. The researchers tested this approach on famous language models like BERT and found that it was 33% faster than usual. And amazingly, it also did better on some tasks! |
Keywords
* Artificial intelligence * Bert