Loading Now

Summary of How Does Code Pretraining Affect Language Model Task Performance?, by Jackson Petty et al.


How Does Code Pretraining Affect Language Model Task Performance?

by Jackson Petty, Sjoerd van Steenkiste, Tal Linzen

First submitted to arxiv on: 6 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A recent paper in the large language model space explores the impact of incorporating non-linguistic data like source code into pretraining corpora. The study trains models on datasets containing a mix of natural language and code, investigating how this affects performance on various tasks. Two key settings are examined: additive, where total data volume remains constant, and competitive, where linguistic data volume is held fixed. The paper analyzes the effects of code proportion on task performance in BigBench benchmark and compositionality measures, including semantic parsing and syntactic transformations. The results indicate that increasing code mixture improves compositional tasks involving structured output, such as mathematics, while having a negative impact on tasks sensitive to linguistic structure like syntax or morphology, and real-world knowledge-based assessments.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models are getting smarter by learning from data that includes code, not just words. This might help with programming-related tasks, but what if it also improves performance on other unrelated tasks? Researchers tried to find out by training models on a mix of natural language and code in different settings. They looked at how this affected performance on various tasks, including math problems, sentence parsing, and understanding real-world knowledge. The results show that using more code helps with certain math-like tasks, but hurts performance on tasks that require understanding language structures or everyday knowledge.

Keywords

» Artificial intelligence  » Large language model  » Parsing  » Pretraining  » Semantic parsing  » Syntax