Summary of Direct Alignment Of Draft Model For Speculative Decoding with Chat-fine-tuned Llms, by Raghavv Goel et al.
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
by Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott
First submitted to arxiv on: 29 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed framework for training a draft model is designed to enable inference acceleration via speculative decoding for large language models (LLMs). The framework, which consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, is demonstrated using Llama 2 Chat 7B as the target model. A new Total Variation Distance++ loss is introduced to incorporate variance reduction techniques inspired by policy gradient methods in reinforcement learning. The trained draft model, Llama 2 Chat Drafter 115M, achieves up to 2.3 block efficiency and 2.4 times speed-up relative to autoregressive decoding on various tasks without further task-specific fine-tuning. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making computers faster at understanding and generating human-like text. Right now, these machines are limited by how much memory they have available. The idea is to create a simpler model that can be used as a shortcut to make the main model work more efficiently. This new approach uses less data than before and trains the draft model in a way that’s similar to how humans learn from each other. |
Keywords
* Artificial intelligence * Autoregressive * Distillation * Fine tuning * Inference * Knowledge distillation * Llama * Pretraining * Reinforcement learning