Summary of Cruxeval-x: a Benchmark For Multilingual Code Reasoning, Understanding and Execution, by Ruiyang Xu et al.
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution
by Ruiyang Xu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, Le Sun
First submitted to arxiv on: 23 Aug 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed CRUXEVAL-X benchmark aims to address the language bias and task bias issues in evaluating Large Language Models’ (LLMs) coding capabilities. The authors propose a multi-lingual code reasoning benchmark that contains 19 programming languages, with at least 600 subjects for each language and 19K content-consistent tests in total. The construction pipeline works in a fully automated and test-guided manner, iteratively generating and repairing based on execution feedback. The authors also formulate various transition rules between language pairs to facilitate translation. The evaluation of 24 representative LLMs reveals the correlation between language pairs, with significant positive correlations observed between TypeScript and JavaScript, and less correlation with Racket. Furthermore, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper proposes a new code reasoning benchmark called CRUXEVAL-X to evaluate Large Language Models’ (LLMs) coding capabilities across different programming languages. The benchmark includes 19 languages and over 19,000 tests. It’s designed to be fully automated, making it more efficient than building custom benchmarks. The authors also found that even LLMs trained on one language can still perform well on other languages, which is important for cross-language coding tasks. |
Keywords
» Artificial intelligence » Generalization » Translation