Loading Now

Summary of Me-switch: a Memory-efficient Expert Switching Framework For Large Language Models, by Jing Liu et al.


ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

by Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang

First submitted to arxiv on: 13 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
LLM development involves pre-training foundation models on massive data, followed by fine-tuning on task-specific data to create specialized experts. However, serving these experts poses significant memory challenges due to the impracticality of loading all experts onto devices and the substantial I/O costs incurred from frequent switching between experts in response to user requests. Previous methods decompose expert weights as pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. To address this issue, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. We propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization to reduce non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Our extensive experiments demonstrate the promising memory efficiency and routing performance of ME-Switch.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making it easier to use many different artificial intelligence models at the same time. These AI models are trained on huge amounts of data, which takes up a lot of space. The authors found that current methods for reducing the size of these models can actually make them worse if they’re not done carefully. They came up with a new way to compress the model’s weights (like a recipe) so it uses less memory and still works well. This is important because we need many AI models to work together to solve complex problems, like understanding human language or generating code. The authors tested their method on several different AI models and showed that it can efficiently use 16 models at the same time.

Keywords

» Artificial intelligence  » Classification  » Fine tuning  » Precision  » Quantization