Summary of Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text with Multi-dimensional and Fine-grained Information, by Wanyue Zhang et al.

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

by Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang

First submitted to arxiv on: 29 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel tool-chain, MDFG-tool, is proposed to construct large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information for training powerful and reliable language models. The approach involves manually crafted rules to discard noisy texts, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a dataset consisting of 3.8TB of text data associated with quality scores, domain labels, toxicity labels, and toxicity scores. This facilitates the selection of data based on various fine-grained information for language model researchers.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Building powerful and reliable language models requires large-scale and high-quality datasets with multi-dimensional and fine-grained information. To address this challenge, a new tool-chain called MDFG-tool is proposed to construct Chinese datasets. The approach involves cleaning raw contents using manually crafted rules, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a large-scale dataset that can facilitate language model research.

Keywords

» Artificial intelligence » Classification » Language model

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

by Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Sowing Information: Cultivating Contextual Coherence with Mllms in Image Generation, by Yuhan Pei and Ruoyu Wang and Yongqi Yang and Ye Zhu and Olga Russakovsky and Yu Wu

Summary of Proceedings Of the 2024 Xcsp3 Competition, by Gilles Audemard et al.

Related Posts