Loading Now

Summary of Efficient Stagewise Pretraining Via Progressive Subnetworks, by Abhishek Panigrahi et al.


Efficient Stagewise Pretraining via Progressive Subnetworks

by Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar

First submitted to arxiv on: 8 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper challenges the prevailing view that stagewise training approaches like layer dropping are ineffective. Instead, it proposes a principled framework called Progressive Subnetwork Training (PST) which trains subnetworks within the model and progressively increases their size during training. This approach is instantiated as Random Part Training (RAPTR), which selects and trains only a random subnetwork at each step, increasing its size in stages. The paper shows that RAPTR generalizes prior works on layer dropping, fixes their key issues, and provides theoretical justification for the approach. Experiments demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a new way to train big language models faster. Usually, people think that dropping layers in the model doesn’t work well. But this paper shows that with a special approach called Random Part Training (RAPTR), you can actually make it work better! RAPTR trains small parts of the model and then gets bigger step by step. It’s like building a big tower block by block. The researchers tested this approach on famous language models like BERT and found that it was 33% faster than usual. And amazingly, it also did better on some tasks!

Keywords

* Artificial intelligence  * Bert