Summary of Efficient Stagewise Pretraining Via Progressive Subnetworks, by Abhishek Panigrahi et al.

Efficient Stagewise Pretraining via Progressive Subnetworks

by Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar

First submitted to arxiv on: 8 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper challenges the prevailing view that stagewise training approaches like layer dropping are ineffective. Instead, it proposes a principled framework called Progressive Subnetwork Training (PST) which trains subnetworks within the model and progressively increases their size during training. This approach is instantiated as Random Part Training (RAPTR), which selects and trains only a random subnetwork at each step, increasing its size in stages. The paper shows that RAPTR generalizes prior works on layer dropping, fixes their key issues, and provides theoretical justification for the approach. Experiments demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about a new way to train big language models faster. Usually, people think that dropping layers in the model doesn’t work well. But this paper shows that with a special approach called Random Part Training (RAPTR), you can actually make it work better! RAPTR trains small parts of the model and then gets bigger step by step. It’s like building a big tower block by block. The researchers tested this approach on famous language models like BERT and found that it was 33% faster than usual. And amazingly, it also did better on some tasks!

Keywords

* Artificial intelligence * Bert

Efficient Stagewise Pretraining via Progressive Subnetworks

by Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, Sanjiv Kumar

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Large Language Model Meets Graph Neural Network in Knowledge Distillation, by Shengxiang Hu et al.

Summary of Pathformer: Multi-scale Transformers with Adaptive Pathways For Time Series Forecasting, by Peng Chen et al.

Related Posts