The Atlantic Reveals 21 Million Songs Circulating in AI Datasets

An investigation published by The Atlantic, led by reporter Alex Reisner, found that four publicly circulating music datasets together contain roughly 21.2 million copyrighted recordings. Per The Atlantic, the largest collections are LAION-DISCO-12M (about 12.6 million tracks) and Sleeping-DISCO-9M (about 9 million tracks), with two smaller datasets including the Free Music Archive (around 100,000 tracks). The Atlantic also launched a free searchable tool, the AI Watchdog, that lets artists check whether their work appears in the collections. Reporting by DJ Mag and MusicTech notes that Google and Stability AI were identified as having drawn on the Free Music Archive dataset. The investigation highlights questions about consent, licensing, and how large-scale scraping is being used in generative audio development.
What happened
An investigation published by The Atlantic, led by reporter Alex Reisner, found that four music datasets circulating in the AI development ecosystem together contain approximately 21.2 million recordings. Per The Atlantic, the two largest datasets are LAION-DISCO-12M, containing about 12.6 million tracks, and Sleeping-DISCO-9M, containing about 9 million tracks. The remaining two collections include the Free Music Archive dataset at roughly 100,000 tracks and a second smaller compilation described in the reporting. The Atlantic made the collections searchable through a free tool called the AI Watchdog so artists, labels, and others can query whether specific works appear in those lists, according to the published report.
Technical details
Per The Atlantic's reporting summarized by MusicTech and WeRaveYou, LAION-DISCO-12M was assembled by LAION using an automated recursive crawl that matched seed artist lists to streaming URLs and was released under an Apache 2.0 licence. The Sleeping-DISCO-9M dataset was compiled by the Sleeping AI Research Collective and has also been hosted on platforms such as Hugging Face. The Atlantic's documentation and MusicTech note that many collections are distributed as lists of links rather than bundled audio files, and that developers commonly use automated download tools to retrieve audio at scale. MusicTech attributes to Reisner the observation that such download methods can bypass mechanisms that generate revenue for creators and may violate platform terms of service.
Industry context
Editorial analysis: Public reporting frames these findings as exposing a gap between how some developers describe training data and the practical realities of large-scale audio ingestion. The datasets mix commercially released music, independent releases, and Creative Commons material, which complicates questions of consent and compensation when the lists are repurposed for model training. Observers quoted across outlets show artists and producers reacting with concern after using the AI Watchdog tool to discover specific inclusions.
Context and significance
Editorial analysis: For practitioners building or auditing generative audio systems, the investigation highlights two persistent operational risks: dataset provenance and licensing ambiguity. Large link-based collections reduce friction for experimentation but also lower the barrier to ingesting commercially released content without explicit rights clearance. That pattern raises legal exposure for downstream users, and it increases the importance of provenance tracking, auditable licences, and supplier due diligence in dataset pipelines.
What to watch
Editorial analysis: Observers and rights holders will likely monitor three indicators:
- •whether platform operators or major AI developers disclose more granular provenance for their audio training data
- •legal or regulatory actions prompted by evidence surfaced through the AI Watchdog tool
- •adoption of dataset filtering or provenance tooling by research groups and vendors. Reporting to date identifies Google and Stability AI as having drawn on the Free Music Archive dataset, per DJ Mag and MusicTech; however, The Atlantic notes that pinpointing which commercial systems used the larger link-based collections is difficult because training data disclosures remain sparse
Implications for practitioners
Editorial analysis: Teams building generative audio models should treat large, publicly shared link lists as high-risk inputs until licence and provenance are validated. Structured approaches include maintaining link-level provenance metadata, prioritising openly licensed corpora, and integrating legal review into data ingestion workflows. The Atlantic's AI Watchdog provides an empirical starting point for rights holders seeking visibility, but it does not by itself resolve licensing or entitlement questions.
Overall, the reporting consolidates multiple public datasets and a searchable tool that together make the scale and composition of audio training material more visible, raising operational, legal, and ethical questions for researchers, vendors, and rights holders.
Scoring Rationale
The story materially raises dataset-provenance and licensing risks that affect generative audio projects and compliance processes. It is notable for scale and visibility but stops short of announcing regulatory or platform-wide changes.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


