Summary of Methods Of Improving Llm Training Stability, by Oleg Rybakov et al.
Methods of improving LLM training stability
by Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, Ben Lanir
First submitted to arxiv on: 22 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty summary: The paper explores training stability of large language models, focusing on a small 830M parameter model with higher learning rates to induce divergence. It identifies growth of logits in attention layers as a source of instability and observes that linear layer outputs can also increase in magnitude, leading to model divergence. To address this, the authors propose applying layer normalization to multiple layers (QKV, Proj, FC2) or just after QKV with softmax capping. They demonstrate significant perplexity improvements using these methods, allowing for a 1.5x increase in learning rate without model divergence. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty summary: This paper is about making sure large language models train properly. The researchers use a small model to test new ideas and find that some parts of the model can get too strong and cause problems. They try different ways to fix this, like normalizing certain layers or adding special caps on calculations. These fixes work well and allow them to learn more quickly without getting stuck. |
Keywords
» Artificial intelligence » Attention » Logits » Perplexity » Softmax