Loading Now

Summary of Memo: Fine-grained Tensor Management For Ultra-long Context Llm Training, by Pinxue Zhao et al.


MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

by Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui

First submitted to arxiv on: 16 Jul 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Distributed, Parallel, and Cluster Computing (cs.DC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel Large Language Model (LLM) training framework called MEMO, designed for fine-grained activation memory management. Long context training is challenging due to GPU memory constraints, which can lead to substantial activation memory consumption and fragmentation. Existing frameworks rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). MEMO offloads memory-consuming activations to CPU memory after each layer’s forward pass and fetches them during the backward pass, implementing a token-wise activation recomputation and swapping mechanism. Additionally, a bi-level Mixed Integer Programming (MIP) approach optimizes memory reuse across transformer layers, minimizing memory fragmentation. Empirical results demonstrate that MEMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed, respectively. This improvement is attributed to MEMO’s ability to efficiently manage memory, reducing recomputation and communication, and circumventing delays due to fragmentation.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research paper creates a new way to train Large Language Models (LLMs) called MEMO. Currently, training LLMs with long context takes up too much computer memory, making it difficult. To solve this problem, the researchers designed a system that moves unused activation information from the computer’s graphics card to its main processor after each step and then brings it back when needed. They also used a special math technique called Mixed Integer Programming (MIP) to make sure they use the computer’s memory efficiently. The results show that MEMO is better than other methods at using computer power while training LLMs.

Keywords

» Artificial intelligence  » Large language model  » Token  » Transformer