Summary of Cruxeval-x: a Benchmark For Multilingual Code Reasoning, Understanding and Execution, by Ruiyang Xu et al.

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

by Ruiyang Xu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, Le Sun

First submitted to arxiv on: 23 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed CRUXEVAL-X benchmark aims to address the language bias and task bias issues in evaluating Large Language Models’ (LLMs) coding capabilities. The authors propose a multi-lingual code reasoning benchmark that contains 19 programming languages, with at least 600 subjects for each language and 19K content-consistent tests in total. The construction pipeline works in a fully automated and test-guided manner, iteratively generating and repairing based on execution feedback. The authors also formulate various transition rules between language pairs to facilitate translation. The evaluation of 24 representative LLMs reveals the correlation between language pairs, with significant positive correlations observed between TypeScript and JavaScript, and less correlation with Racket. Furthermore, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper proposes a new code reasoning benchmark called CRUXEVAL-X to evaluate Large Language Models’ (LLMs) coding capabilities across different programming languages. The benchmark includes 19 languages and over 19,000 tests. It’s designed to be fully automated, making it more efficient than building custom benchmarks. The authors also found that even LLMs trained on one language can still perform well on other languages, which is important for cross-language coding tasks.

Keywords

» Artificial intelligence » Generalization » Translation

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

by Ruiyang Xu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, Le Sun

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Exploring Machine Learning Models For Lung Cancer Level Classification: a Comparative Ml Approach, by Mohsen Asghari Ilani et al.

Summary of Temporal Fairness in Decision Making Problems, by Manuel R. Torres et al.

Related Posts