Loading Now

Summary of How to Train Long-context Language Models (effectively), by Tianyu Gao et al.


How to Train Long-Context Language Models (Effectively)

by Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen

First submitted to arxiv on: 3 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
We investigate continued training and supervised fine-tuning (SFT) of a language model (LM) to leverage long-context information efficiently. To guide model development, we establish a reliable evaluation protocol using a broad set of long-context tasks, evaluating models after SFT with instruction data, which better reveals long-context abilities. Our robust evaluations support thorough experiments deciding the data mix for continued pre-training, the instruction tuning dataset, and other design choices. We find that code repositories and books are excellent sources of long data when combined with high-quality short data; training with sequence lengths beyond evaluation lengths boosts long-context performance; and using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, initialized from Llama-3, trained on 40B tokens, achieves state-of-the-art long-context performance among similarly sized models at a length of 128K, outperforming Llama-3.18B-Instruct on most long-context tasks despite seeing only 5% as many tokens during long-context training. ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores how to make language models better at understanding longer pieces of text. Instead of just looking at simple tests like word order, we use a variety of tasks that require understanding long sentences or paragraphs. We find that combining different types of training data and using longer training sequences helps the model understand even more complex texts. Our best model, called ProLong-8B, can process very long pieces of text (up to 512K tokens) and outperforms other models on most tasks.

Keywords

» Artificial intelligence  » Fine tuning  » Instruction tuning  » Language model  » Llama  » Supervised