Summary of Scaling Fp8 Training to Trillion-token Llms, by Maxim Fishman et al.
Scaling FP8 training to trillion-token LLMs
by Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry
First submitted to arxiv on: 19 Sep 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a breakthrough in training large language models using the FP8 precision, which allows for up to 20 times more data than previous limits. The authors discover that the SwiGLU activation function can amplify outliers during prolonged training periods, leading to instabilities. To address this issue, they introduce Smooth-SwiGLU, a modified version of the original function that ensures stable training without altering its behavior. Additionally, the paper demonstrates the quantization of Adam optimizer moments using FP8 precision for the first time. The authors successfully train a 7B parameter model on 256 Intel Gaudi2 accelerators, achieving comparable results to the BF16 baseline while delivering up to a 34% throughput improvement. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about training big language models with a new type of math called FP8 precision. It’s like solving a really hard puzzle, but instead of pieces, they’re using lots and lots of words! The authors found that something called the SwiGLU activation function can make problems if you train too long. So, they made a new version called Smooth-SwiGLU to fix this issue. They also figured out how to use FP8 precision for the Adam optimizer, which helps with training the models. In the end, they were able to train a really big model (7 billion parameters!) and it worked just as well as other methods but was faster! |
Keywords
* Artificial intelligence * Precision * Quantization