Security & Riskdatasetsmusic generationcopyrightthe atlantic

The Atlantic Reveals 21 Million Songs Circulating in AI Datasets

|June 24, 2026|By LDS Team

7.2

Relevance Score

The Atlantic Reveals 21 Million Songs Circulating in AI Datasets — Photo: djmag.com · rights & takedowns

An investigation published by The Atlantic, led by reporter Alex Reisner, found that four publicly circulating music datasets together contain roughly 21.2 million copyrighted recordings. Per The Atlantic, the largest collections are LAION-DISCO-12M (about 12.6 million tracks) and Sleeping-DISCO-9M (about 9 million tracks), with two smaller datasets including the Free Music Archive (around 100,000 tracks). The Atlantic also launched a free searchable tool, the AI Watchdog, that lets artists check whether their work appears in the collections. Reporting by DJ Mag and MusicTech notes that Google and Stability AI were identified as having drawn on the Free Music Archive dataset. The investigation highlights questions about consent, licensing, and how large-scale scraping is being used in generative audio development.

What happened

An investigation published by The Atlantic, led by reporter Alex Reisner, found that four music datasets circulating in the AI development ecosystem together contain approximately 21.2 million recordings. Per The Atlantic, the two largest datasets are LAION-DISCO-12M, containing about 12.6 million tracks, and Sleeping-DISCO-9M, containing about 9 million tracks. The remaining two collections include the Free Music Archive dataset at roughly 100,000 tracks and a second smaller compilation described in the reporting. The Atlantic made the collections searchable through a free tool called the AI Watchdog so artists, labels, and others can query whether specific works appear in those lists, according to the published report.

Technical details

Per The Atlantic's reporting summarized by MusicTech and WeRaveYou, LAION-DISCO-12M was assembled by LAION using an automated recursive crawl that matched seed artist lists to streaming URLs and was released under an Apache 2.0 licence. The Sleeping-DISCO-9M dataset was compiled by the Sleeping AI Research Collective and has also been hosted on platforms such as Hugging Face. The Atlantic's documentation and MusicTech note that many collections are distributed as lists of links rather than bundled audio files, and that developers commonly use automated download tools to retrieve audio at scale. MusicTech attributes to Reisner the observation that such download methods can bypass mechanisms that generate revenue for creators and may violate platform terms of service.

Industry context

Context and significance

What to watch

Implications for practitioners

Editorial analysis

Public reporting frames these findings as exposing a gap between how some developers describe training data and the practical realities of large-scale audio ingestion. The datasets mix commercially released music, independent releases, and Creative Commons material, which complicates questions of consent and compensation when the lists are repurposed for model training. Observers quoted across outlets show artists and producers reacting with concern after using the AI Watchdog tool to discover specific inclusions.

For practitioners building or auditing generative audio systems, the investigation highlights two persistent operational risks: dataset provenance and licensing ambiguity. Large link-based collections reduce friction for experimentation but also lower the barrier to ingesting commercially released content without explicit rights clearance. That pattern raises legal exposure for downstream users, and it increases the importance of provenance tracking, auditable licences, and supplier due diligence in dataset pipelines.

Observers and rights holders will likely monitor three indicators:

•whether platform operators or major AI developers disclose more granular provenance for their audio training data
•legal or regulatory actions prompted by evidence surfaced through the AI Watchdog tool
•adoption of dataset filtering or provenance tooling by research groups and vendors. Reporting to date identifies Google and Stability AI as having drawn on the Free Music Archive dataset, per DJ Mag and MusicTech; however, The Atlantic notes that pinpointing which commercial systems used the larger link-based collections is difficult because training data disclosures remain sparse

Teams building generative audio models should treat large, publicly shared link lists as high-risk inputs until licence and provenance are validated. Structured approaches include maintaining link-level provenance metadata, prioritising openly licensed corpora, and integrating legal review into data ingestion workflows. The Atlantic's AI Watchdog provides an empirical starting point for rights holders seeking visibility, but it does not by itself resolve licensing or entitlement questions.

Overall, the reporting consolidates multiple public datasets and a searchable tool that together make the scale and composition of audio training material more visible, raising operational, legal, and ethical questions for researchers, vendors, and rights holders.

Key Points

1The Atlantic's investigation finds roughly 21.2 million tracks across four datasets, exposing large-scale access to copyrighted music.
2Link-based datasets and automated downloading simplify ingestion but create provenance and licensing ambiguity for model builders.
3The free AI Watchdog tool gives rights holders empirical visibility, increasing the likelihood of legal and compliance scrutiny.

Scoring Rationale

The Atlantic's investigation gives rights holders and model builders their first searchable, public accounting of the scale of copyrighted music in AI training datasets -- over 21 million recordings. That combination of empirical scale, a public lookup tool, and the ongoing Suno/Udio litigation makes this materially actionable for compliance and legal teams, though it does not itself change licensing law or force platform changes.

MoreAI Privacy news

Sources

Primary source and supporting public references used for this report.

7 sources

Primary sourcedjmag.comOver 21 million copyrighted songs are circulating among AI developers, watchdog tool launched

View 6 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems