Summary of Naturalcodebench: Examining Coding Performance Mismatch on Humaneval and Natural User Prompts, by Shudan Zhang et al.

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

by Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang

First submitted to arxiv on: 7 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed NaturalCodeBench (NCB) code benchmark aims to address the limitations of current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000. These existing benchmarks focus on introductory tasks in algorithm and data science, which are insufficiently challenging compared to real-world coding scenarios. To fill this gap, NCB consists of 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services across six different domains. The benchmark also includes a semi-automated pipeline to enhance the efficiency of test case construction, achieving an increase of more than four times compared to manual solutions. Systematic experiments on 39 large language models (LLMs) reveal significant performance gaps between models with close HumanEval scores, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 is still far from satisfying on NCB.
Low	GrooveSquid.com (original content)	Low Difficulty Summary NCB is a new code benchmark designed to test large language models (LLMs) in real-world coding tasks. Current benchmarks like HumanEval and MBPP are too easy, so NCB has 402 problems in Python and Java that are harder and more varied. The developers also created a tool to make it easier to create tests for this benchmark. They tested 39 different LLMs and found that even the best one, GPT-4, is not good enough at solving these coding tasks.

Keywords

» Artificial intelligence » Gpt » Optimization

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

by Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Vaeneu: a New Avenue For Vae Application on Probabilistic Forecasting, by Alireza Koochali et al.

Summary of Qserve: W4a8kv4 Quantization and System Co-design For Efficient Llm Serving, by Yujun Lin and Haotian Tang and Shang Yang and Zhekai Zhang and Guangxuan Xiao and Chuang Gan and Song Han

Related Posts