Summary of Training Language Models on Synthetic Edit Sequences Improves Code Synthesis, by Ulyana Piterbarg and Lerrel Pinto and Rob Fergus
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
by Ulyana Piterbarg, Lerrel Pinto, Rob Fergus
First submitted to arxiv on: 3 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty Summary: This research paper presents a novel approach for generating synthetic data for code synthesis, addressing the scarcity of sequential edit data. The proposed algorithm, LintSeq, refactors programs into sequences of synthetic edits by procedurally sampling across interdependent lines of source code using a linter. To evaluate the algorithm, the authors fine-tune language models ranging from 2.6B to 14B parameters on both the re-factored and original versions of a dataset. The results show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, with better scaling across higher pass@k as a function of total test-time FLOPs. Furthermore, the authors pretrain tiny LMs for code understanding and demonstrate strong performance on HumanEval and MBPP(+) when fine-tuning them to synthesize code edit-by-edit. The paper also compares the proposed approach to existing code language models such as CodeT5+, AlphaCode, and Codex. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty Summary: This research is about helping computers write better code by generating more training data. Currently, humans mostly edit existing code rather than writing it from scratch. This makes it hard for computers to learn how to write good code on their own. The researchers developed a new method called LintSeq that can generate synthetic edits for programming languages. They tested this method by fine-tuning language models and found that the models performed better when trained with the generated data. This could lead to more accurate code writing in the future. |
Keywords
* Artificial intelligence * Fine tuning * Synthetic data