Summary of Consent in Crisis: the Rapid Decline Of the Ai Data Commons, by Shayne Longpre et al.
Consent in Crisis: The Rapid Decline of the AI Data Commons
by Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
First submitted to arxiv on: 20 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a large-scale audit of the consent protocols for web domains underlying artificial intelligence (AI) training corpora. The authors analyze 14,000 web domains to understand how codified data use preferences are changing over time. Their findings show a proliferation of AI-specific clauses limiting data use and inconsistencies between websites’ expressed intentions and their actual restrictions. The audit reveals that in just one year, there has been a rapid increase in data restrictions from web sources, rendering large portions of the C4 corpus (a popular AI training dataset) fully restricted for use. This could have significant implications for general-purpose AI systems, commercial AI applications, non-commercial AI, and academic research. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary AI researchers are building big artificial intelligence systems using massive amounts of data from the internet. But did you know that most of this data is taken without people’s consent? The paper looks at how websites allow or don’t allow their data to be used in AI training. The authors found that many websites have changed their rules over time, and some are even restricting what AI developers can do with their data. This could make it harder for researchers to get the data they need to train AI systems. The authors think this is a big problem that needs to be solved. |