Summary of An Empirical Study on Large Language Models in Accuracy and Robustness Under Chinese Industrial Scenarios, by Zongjie Li et al.
An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios
by Zongjie Li, Wenying Qiu, Pingchuan Ma, Yichen Li, You Li, Sijia He, Baozheng Jiang, Shuai Wang, Weixi Gu
First submitted to arxiv on: 27 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a comprehensive empirical study on the accuracy and robustness of large language models (LLMs) in the context of Chinese industrial production. Specifically, it evaluates 9 different LLMs developed by Chinese vendors and 4 global ones on domain-specific problems and metamorphic testing framework. The results show that current LLMs exhibit low accuracy (less than 0.6) in Chinese industrial contexts, with local LLMs performing worse than global ones overall. Robustness scores vary across industrial sectors and abilities, highlighting the need for further research and tooling support. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper looks at how good big language models are at understanding problems from different industries in China. It tests 13 Chinese and 4 global models on many different types of questions to see how well they do. The results show that these models aren’t very accurate (most get less than 60% right) and have trouble with certain types of questions. The study helps us understand what we can expect from these models in real-world situations and where we might need to improve them. | 




