Loading Now

Summary of Hobbit: a Mixed Precision Expert Offloading System For Fast Moe Inference, by Peng Tang et al.


HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

by Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo

First submitted to arxiv on: 3 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Distributed, Parallel, and Cluster Computing (cs.DC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The Mixture-of-Experts (MoE) architecture has shown promise in Large Language Models (LLMs), offering improved capabilities at reduced inference costs. However, deploying MoE-based LLMs on memory-constrained edge devices remains challenging due to their substantial memory requirements. To address this issue, we propose HOBBIT, a mixed precision expert offloading system that enables flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low-precision versions can significantly reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: token-level dynamic expert loading, layer-level adaptive expert prefetching, and sequence-level multidimensional expert caching. These innovations fully leverage the benefits of mixed-precision expert inference. We evaluate HOBBIT’s performance across different edge devices with representative MoE models, achieving up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
Low GrooveSquid.com (original content) Low Difficulty Summary
HOBBIT is a new way to make Large Language Models (LLMs) work better on small devices like phones or smart home gadgets. These devices have limited memory and processing power, making it hard to use LLMs that are usually used on powerful computers. The problem is that these LLMs need a lot of memory to store lots of information. HOBBIT solves this by finding ways to reduce the amount of memory needed while still keeping the LLM accurate.

Keywords

» Artificial intelligence  » Inference  » Mixture of experts  » Precision  » Token