Loading Now

Summary of Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text with Multi-dimensional and Fine-grained Information, by Wanyue Zhang et al.


ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

by Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang

First submitted to arxiv on: 29 Nov 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel tool-chain, MDFG-tool, is proposed to construct large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information for training powerful and reliable language models. The approach involves manually crafted rules to discard noisy texts, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a dataset consisting of 3.8TB of text data associated with quality scores, domain labels, toxicity labels, and toxicity scores. This facilitates the selection of data based on various fine-grained information for language model researchers.
Low GrooveSquid.com (original content) Low Difficulty Summary
Building powerful and reliable language models requires large-scale and high-quality datasets with multi-dimensional and fine-grained information. To address this challenge, a new tool-chain called MDFG-tool is proposed to construct Chinese datasets. The approach involves cleaning raw contents using manually crafted rules, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a large-scale dataset that can facilitate language model research.

Keywords

» Artificial intelligence  » Classification  » Language model