Loading Now

Summary of Cif-bench: a Chinese Instruction-following Benchmark For Evaluating the Generalizability Of Large Language Models, by Yizhi Li et al.


CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

by Yizhi LI, Ge Zhang, Xingwei Qu, Jiali Li, Zhaoqun Li, Zekun Wang, Hao Li, Ruibin Yuan, Yinghao Ma, Kai Zhang, Wangchunshu Zhou, Yiming Liang, Lei Zhang, Lei Ma, Jiajun Zhang, Zuowen Li, Stephen W. Huang, Chenghua Lin, Jie Fu

First submitted to arxiv on: 20 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of large language models (LLMs) to the Chinese language. The benchmark comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, only half of the dataset is released publicly, with the remainder kept private, and diversified instructions are introduced to minimize score variance, totaling 45,000 data instances. The evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper helps us understand how well big language models can work on tasks we haven’t shown them before. Right now, these models are great at doing things like answering questions or translating text from one language to another. But they’re not as good when it comes to languages like Chinese that are very different from the ones they were trained on. To test how well these models do in Chinese, the researchers created a special benchmark with 150 tasks and lots of examples. They then tested 28 different models and found that none of them did very well – the best one got only about 53% correct. This shows us that big language models have a lot to learn when it comes to working with languages like Chinese.

Keywords

» Artificial intelligence  » Zero shot