Loading Now

Summary of Galore: Memory-efficient Llm Training by Gradient Low-rank Projection, By Jiawei Zhao et al.


GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

by Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian

First submitted to arxiv on: 6 Mar 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed Gradient Low-Rank Projection (GaLore) training strategy enables full-parameter learning while reducing memory usage by up to 65.5% in optimizer states. This approach maintains efficiency and performance for pre-training on LLaMA architectures with C4 dataset and fine-tuning RoBERTa on GLUE tasks. Additionally, the 8-bit GaLore reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. The paper demonstrates the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory without model parallel, checkpointing, or offloading strategies.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Language Models (LLMs) are trained using large amounts of data and powerful computers. However, this process requires significant memory to store all the information. Researchers have developed ways to reduce memory usage while still getting good results. In this paper, they propose a new way called Gradient Low-Rank Projection (GaLore). This method allows them to train models with many parameters without using too much memory. They tested GaLore on different types of models and tasks, showing that it works well and is efficient.

Keywords

* Artificial intelligence  * Fine tuning  * Llama