Summary of Naturalcodebench: Examining Coding Performance Mismatch on Humaneval and Natural User Prompts, by Shudan Zhang et al.
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
by Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang
First submitted to arxiv on: 7 May 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG); Software Engineering (cs.SE)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed NaturalCodeBench (NCB) code benchmark aims to address the limitations of current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000. These existing benchmarks focus on introductory tasks in algorithm and data science, which are insufficiently challenging compared to real-world coding scenarios. To fill this gap, NCB consists of 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services across six different domains. The benchmark also includes a semi-automated pipeline to enhance the efficiency of test case construction, achieving an increase of more than four times compared to manual solutions. Systematic experiments on 39 large language models (LLMs) reveal significant performance gaps between models with close HumanEval scores, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 is still far from satisfying on NCB. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary NCB is a new code benchmark designed to test large language models (LLMs) in real-world coding tasks. Current benchmarks like HumanEval and MBPP are too easy, so NCB has 402 problems in Python and Java that are harder and more varied. The developers also created a tool to make it easier to create tests for this benchmark. They tested 39 different LLMs and found that even the best one, GPT-4, is not good enough at solving these coding tasks. |
Keywords
» Artificial intelligence » Gpt » Optimization