Summary of Newsbench: a Systematic Evaluation Framework For Assessing Editorial Capabilities Of Large Language Models in Chinese Journalism, by Miao Li and Ming-bin Chen and Bo Tang and Shengbin Hou and Pengyu Wang and Haiying Deng and Zhiyu Li and Feiyu Xiong and Keming Mao and Peng Cheng and Yi Luo

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

by Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo

First submitted to arxiv on: 29 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel evaluation framework, NewsBench, is introduced to assess the capabilities of Large Language Models (LLMs) for editorial tasks in Chinese journalism. The benchmark dataset comprises 1,267 test samples in multiple-choice and short-answer formats, focusing on writing proficiency and safety adherence across five editorial tasks in 24 news domains. To measure performance, GPT-4-based automatic evaluation protocols are proposed to assess LLM generations in terms of writing proficiency and safety adherence, validated by high correlations with human evaluations. The framework is used to analyze the performances of ten popular LLMs that can handle Chinese, revealing top performers like GPT-4 and ERNIE Bot, but also highlighting a deficiency in journalistic safety adherence in creative writing tasks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models are being trained to help journalists write articles. But how well do they really do? To find out, researchers created a special test set of questions that ask for things like short answers and multiple-choice answers. The questions cover different areas like writing skills and making sure the content is safe. They used a computer program based on a popular language model called GPT-4 to see how well the models did. They tested ten different models and found that some were better than others at writing articles. Some models, like GPT-4 and ERNIE Bot, were really good! But they also noticed that these models weren’t very good at making sure the content was safe. This means we need to make sure we’re guiding these language models in a way that helps them write responsibly.

Keywords

» Artificial intelligence » Gpt » Language model

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

by Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Ram-ehr: Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records, by Ran Xu et al.

Summary of A Survey Of Ai-generated Text Forensic Systems: Detection, Attribution, and Characterization, by Tharindu Kumarage et al.

Related Posts