Summary of Simultaneous Computation and Memory Efficient Zeroth-order Optimizer For Fine-tuning Large Language Models, by Fei Wang et al.
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models
by Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding
First submitted to arxiv on: 13 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to fine-tuning large language models for downstream tasks while mitigating memory usage issues. Zeroth-Order (ZO) optimization estimates gradients to replace First-Order (FO) gradient calculations, but at the cost of longer training time due to its stochastic nature. The Memory-efficient ZO (MeZO) optimizer is revisited, revealing that full-parameter perturbation and updating processes account for over 50% of its fine-tuning time cost. To address this, a novel layer-wise sparse computation and memory efficient ZO optimizer, LeZO, is introduced. LeZO treats layers as fundamental units for sparsification and dynamically perturbs different parameter subsets in each step to achieve full-parameter fine-tuning. This approach incorporates layer-wise parameter sparsity in the process of simultaneous perturbation stochastic approximation (SPSA) and ZO-stochastic gradient descent (ZO-SGD). Experimental results with the OPT model family on the SuperGLUE benchmark and two generative tasks demonstrate that LeZO accelerates training without compromising performance, achieving over 3x speedup compared to MeZO on specific tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research paper is about making big language models better for smaller tasks. One way to do this is by using something called Zeroth-Order optimization. This approach helps reduce memory usage, but it takes longer to train the model. The researchers looked at a specific type of Zeroth-Order optimization called MeZO and found that most of its time is spent on processing large amounts of data. To solve this problem, they created a new method called LeZO. LeZO breaks down big models into smaller parts, updates them quickly, and uses less memory. They tested LeZO with different language models and tasks and found that it’s faster without sacrificing performance. |
Keywords
» Artificial intelligence » Fine tuning » Optimization » Stochastic gradient descent