Summary of Bridging the Data Provenance Gap Across Text, Speech and Video, by Shayne Longpre et al.

Bridging the Data Provenance Gap Across Text, Speech and Video

by Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara

First submitted to arxiv on: 19 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper conducts the largest-ever longitudinal audit of popular text, speech, and video datasets from 1990-2024, analyzing their sourcing trends, use restrictions, geographical, and linguistic representation. The analysis covers nearly 4000 public datasets spanning 608 languages, 798 sources, 659 organizations, and 67 countries. It finds that multimodal machine learning applications rely heavily on web-crawled, synthetic, and social media platforms like YouTube since 2019. Additionally, the paper reveals that while only a third of datasets are restrictively licensed, over 80% of source content in widely-used datasets carry non-commercial restrictions. Despite the increasing representation of languages and geographies, measures of relative geographical and multilingual representation have not improved significantly since 2013. The study aims to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, providing essential insights for responsible AI development.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how AI is trained using datasets from the past few decades. Researchers analyzed over 4000 public datasets across text, speech, and video formats. They found that many AI models rely on internet data and social media platforms like YouTube. The study also shows that most dataset sources have restrictions on commercial use, which could impact AI development. Additionally, despite more languages and countries being represented in datasets, there hasn’t been a significant improvement since 2013.

Keywords

* Artificial intelligence * Machine learning

Bridging the Data Provenance Gap Across Text, Speech and Video

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Minimax Optimal Simple Regret in Two-armed Best-arm Identification, by Masahiro Kato

Summary of Archcomplete: Autoregressive 3d Architectural Design Generation with Hierarchical Diffusion-based Upsampling, by S. Rasoulzadeh et al.

Related Posts