Loading Now

Summary of Direct Alignment Of Draft Model For Speculative Decoding with Chat-fine-tuned Llms, by Raghavv Goel et al.


Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs

by Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott

First submitted to arxiv on: 29 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed framework for training a draft model is designed to enable inference acceleration via speculative decoding for large language models (LLMs). The framework, which consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, is demonstrated using Llama 2 Chat 7B as the target model. A new Total Variation Distance++ loss is introduced to incorporate variance reduction techniques inspired by policy gradient methods in reinforcement learning. The trained draft model, Llama 2 Chat Drafter 115M, achieves up to 2.3 block efficiency and 2.4 times speed-up relative to autoregressive decoding on various tasks without further task-specific fine-tuning.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making computers faster at understanding and generating human-like text. Right now, these machines are limited by how much memory they have available. The idea is to create a simpler model that can be used as a shortcut to make the main model work more efficiently. This new approach uses less data than before and trains the draft model in a way that’s similar to how humans learn from each other.

Keywords

* Artificial intelligence  * Autoregressive  * Distillation  * Fine tuning  * Inference  * Knowledge distillation  * Llama  * Pretraining  * Reinforcement learning