Summary of Bridging the Data Provenance Gap Across Text, Speech and Video, by Shayne Longpre et al.
Bridging the Data Provenance Gap Across Text, Speech and Video
by Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
First submitted to arxiv on: 19 Dec 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper conducts the largest-ever longitudinal audit of popular text, speech, and video datasets from 1990-2024, analyzing their sourcing trends, use restrictions, geographical, and linguistic representation. The analysis covers nearly 4000 public datasets spanning 608 languages, 798 sources, 659 organizations, and 67 countries. It finds that multimodal machine learning applications rely heavily on web-crawled, synthetic, and social media platforms like YouTube since 2019. Additionally, the paper reveals that while only a third of datasets are restrictively licensed, over 80% of source content in widely-used datasets carry non-commercial restrictions. Despite the increasing representation of languages and geographies, measures of relative geographical and multilingual representation have not improved significantly since 2013. The study aims to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, providing essential insights for responsible AI development. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how AI is trained using datasets from the past few decades. Researchers analyzed over 4000 public datasets across text, speech, and video formats. They found that many AI models rely on internet data and social media platforms like YouTube. The study also shows that most dataset sources have restrictions on commercial use, which could impact AI development. Additionally, despite more languages and countries being represented in datasets, there hasn’t been a significant improvement since 2013. |
Keywords
» Artificial intelligence » Machine learning