Summary of Newsbench: a Systematic Evaluation Framework For Assessing Editorial Capabilities Of Large Language Models in Chinese Journalism, by Miao Li and Ming-bin Chen and Bo Tang and Shengbin Hou and Pengyu Wang and Haiying Deng and Zhiyu Li and Feiyu Xiong and Keming Mao and Peng Cheng and Yi Luo
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
by Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo
First submitted to arxiv on: 29 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel evaluation framework, NewsBench, is introduced to assess the capabilities of Large Language Models (LLMs) for editorial tasks in Chinese journalism. The benchmark dataset comprises 1,267 test samples in multiple-choice and short-answer formats, focusing on writing proficiency and safety adherence across five editorial tasks in 24 news domains. To measure performance, GPT-4-based automatic evaluation protocols are proposed to assess LLM generations in terms of writing proficiency and safety adherence, validated by high correlations with human evaluations. The framework is used to analyze the performances of ten popular LLMs that can handle Chinese, revealing top performers like GPT-4 and ERNIE Bot, but also highlighting a deficiency in journalistic safety adherence in creative writing tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models are being trained to help journalists write articles. But how well do they really do? To find out, researchers created a special test set of questions that ask for things like short answers and multiple-choice answers. The questions cover different areas like writing skills and making sure the content is safe. They used a computer program based on a popular language model called GPT-4 to see how well the models did. They tested ten different models and found that some were better than others at writing articles. Some models, like GPT-4 and ERNIE Bot, were really good! But they also noticed that these models weren’t very good at making sure the content was safe. This means we need to make sure we’re guiding these language models in a way that helps them write responsibly. |
Keywords
» Artificial intelligence » Gpt » Language model