Summary of Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text with Multi-dimensional and Fine-grained Information, by Wanyue Zhang et al.
ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
by Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang
First submitted to arxiv on: 29 Nov 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel tool-chain, MDFG-tool, is proposed to construct large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information for training powerful and reliable language models. The approach involves manually crafted rules to discard noisy texts, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a dataset consisting of 3.8TB of text data associated with quality scores, domain labels, toxicity labels, and toxicity scores. This facilitates the selection of data based on various fine-grained information for language model researchers. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Building powerful and reliable language models requires large-scale and high-quality datasets with multi-dimensional and fine-grained information. To address this challenge, a new tool-chain called MDFG-tool is proposed to construct Chinese datasets. The approach involves cleaning raw contents using manually crafted rules, quality evaluation, domain classification, and toxicity assessment. This results in the release of ChineseWebText2.0, a large-scale dataset that can facilitate language model research. |
Keywords
» Artificial intelligence » Classification » Language model