Summary of A Shocking Amount Of the Web Is Machine Translated: Insights From Multi-way Parallelism, by Brian Thompson et al.
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
by Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
First submitted to arxiv on: 11 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper investigates the phenomenon of machine-generated translations on the web and their impact on the quality of multilingual content. The study reveals that most translations are created using Machine Translation (MT) technology, which often produces low-quality output. Furthermore, the analysis shows that MT-based translations dominate the online content in lower-resource languages, comprising a significant portion of the total available content. The authors also identify a selection bias in the type of content being translated into multiple languages, suggesting that poor-quality English content is being mass-translated into other languages using MT. This raises concerns about training AI models on web-sourced data, including multilingual large language models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how the internet is full of translations that were made by machines instead of people. These machine-made translations are often bad quality and they take over online content in countries with fewer resources. The research shows that these bad translations make up a big part of what’s available online in those languages. It also finds out that certain types of English content get translated into many languages, even if it’s not very good. This means we might be training AI models on bad data. |
Keywords
» Artificial intelligence » Translation