Summary of Building a Large Japanese Web Corpus For Large Language Models, by Naoaki Okazaki et al.
Building a Large Japanese Web Corpus for Large Language Models
by Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
First submitted to arxiv on: 27 Apr 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This study aims to create a large-scale Japanese language model training corpus by refining text extracted from the Common Crawl archive. The resulting corpus consists of approximately 312.1 billion characters, surpassing existing corpora like CC-100 and mC4. To validate the quality of this corpus, continual pre-training was performed on various base LLMs, leading to consistent improvements on Japanese benchmark datasets. Notably, the presented corpus brought about the largest improvement on Llama 2 13B compared to other existing corporas. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study creates a huge collection of Japanese text from the internet to train language models. It’s like a big library that helps machines understand Japanese better. The new corpus is much bigger than what existed before and makes language models smarter when it comes to understanding Japanese texts. This can help with things like chatbots, language translation, and more. |
Keywords
» Artificial intelligence » Language model » Llama » Translation