Summary of Hobbit: a Mixed Precision Expert Offloading System For Fast Moe Inference, by Peng Tang et al.

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

by Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo

First submitted to arxiv on: 3 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The Mixture-of-Experts (MoE) architecture has shown promise in Large Language Models (LLMs), offering improved capabilities at reduced inference costs. However, deploying MoE-based LLMs on memory-constrained edge devices remains challenging due to their substantial memory requirements. To address this issue, we propose HOBBIT, a mixed precision expert offloading system that enables flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low-precision versions can significantly reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: token-level dynamic expert loading, layer-level adaptive expert prefetching, and sequence-level multidimensional expert caching. These innovations fully leverage the benefits of mixed-precision expert inference. We evaluate HOBBIT’s performance across different edge devices with representative MoE models, achieving up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
Low	GrooveSquid.com (original content)	Low Difficulty Summary HOBBIT is a new way to make Large Language Models (LLMs) work better on small devices like phones or smart home gadgets. These devices have limited memory and processing power, making it hard to use LLMs that are usually used on powerful computers. The problem is that these LLMs need a lot of memory to store lots of information. HOBBIT solves this by finding ways to reduce the amount of memory needed while still keeping the LLM accurate.

Keywords

» Artificial intelligence » Inference » Mixture of experts » Precision » Token

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

by Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Generalized Eigenvalue Problems with Generative Priors, by Zhaoqiang Liu et al.

Summary of Privacy-preserving Customer Churn Prediction Model in the Context Of Telecommunication Industry, by Joydeb Kumar Sana et al.

Related Posts