Loading Now

Summary of Building a Large Japanese Web Corpus For Large Language Models, by Naoaki Okazaki et al.


Building a Large Japanese Web Corpus for Large Language Models

by Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki

First submitted to arxiv on: 27 Apr 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This study aims to create a large-scale Japanese language model training corpus by refining text extracted from the Common Crawl archive. The resulting corpus consists of approximately 312.1 billion characters, surpassing existing corpora like CC-100 and mC4. To validate the quality of this corpus, continual pre-training was performed on various base LLMs, leading to consistent improvements on Japanese benchmark datasets. Notably, the presented corpus brought about the largest improvement on Llama 2 13B compared to other existing corporas.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study creates a huge collection of Japanese text from the internet to train language models. It’s like a big library that helps machines understand Japanese better. The new corpus is much bigger than what existed before and makes language models smarter when it comes to understanding Japanese texts. This can help with things like chatbots, language translation, and more.

Keywords

» Artificial intelligence  » Language model  » Llama  » Translation