Products & Toolsdataset transparencymusic aicopyrightai datasets

Researcher Launches Tool Tracking Music in AI Datasets

|June 22, 2026|By LDS Team

6.5

Relevance Score

Researcher Launches Tool Tracking Music in AI Datasets — Photo: thequietus.com · rights & takedowns

AI Watchdog, a searchable tool created by researcher Alex Reisner, has been updated to include four music datasets identified in The Atlantic investigation, according to reporting in The Verge and The Quietus. The largest two collections contain 12 million and 9 million tracks respectively, while the two smaller sets each contain over 100,000 tracks, per The Verge. The datasets have been downloaded thousands of times and include millions of copyrighted songs, reporting reproduced by Music Ally and The Verge notes. The Quietus reports that companies including Google and Stability have openly used these datasets in research papers, and that artist SZA posted an Instagram claim that "music AI has trained off 238 of my songs," quoted in The Quietus. Industry observers should expect increased scrutiny on dataset provenance and artist recourse as these findings enter public view.

What happened

AI Watchdog, a searchable tool maintained by researcher Alex Reisner, has been updated to show which songs appear in four large music datasets identified in reporting by The Atlantic, as summarised by The Verge and reported by The Quietus. The Verge reports that two of the datasets contain 12 million and 9 million tracks respectively, and that the two smaller datasets each contain over 100,000 tracks. Music Ally and The Verge reproduce The Atlantic's finding that these datasets have been downloaded thousands of times, and The Quietus reports that companies including Google and Stability have been named in research papers as users of some of these collections. The Quietus also cites an Instagram post from artist SZA, quoted there, claiming "music AI has trained off 238 of my songs."

Technical details

The Verge includes reporting from Alex Reisner describing how three of the datasets are distributed as public lists of links to tracks on streaming platforms such as YouTube and Spotify, and that developers commonly use automated tools to fetch the audio, a process that can bypass platform mechanisms, per The Verge. The Verge further notes that some source collections are streamable for personal use but require licensing for commercial use.

Editorial analysis - technical context

Datasets composed of link lists create a low-friction path from public streaming pages to model training corpora, which complicates provenance tracking for downstream models and increases legal and copyright risk vectors for data engineers and ML teams.

Context and significance

Reporting reproduced by Music Ally places these revelations amid ongoing litigation and rights disputes in the music-tech sector, noting similar questions at the core of lawsuits against companies such as Suno, where plaintiffs allege large-scale use of commercial recordings for model training. The public visibility of dataset composition via a tool like AI Watchdog raises the likelihood of greater scrutiny from rights holders, regulators, and plaintiffs' lawyers, according to the framing in the assembled coverage.

What to watch

whether more rights holders cite specific dataset membership in ongoing or new lawsuits, and whether major platform or model developers disclose dataset provenance more fully following this reporting. What to watch: updates to terms of service or enforcement by streaming platforms if automated download tools are shown to be widely used for large-scale scraping, as described in The Verge. What to watch: community responses from artists and creators, following vocal statements quoted in The Quietus and the broader reporting.

For practitioners

Data scientists and ML engineers working with audio should treat public lists-of-links datasets as high-risk from a provenance and licensing perspective and prioritize logging, provenance metadata, and legal review processes when assembling training corpora. For practitioners: teams building detection, watermarking, or dataset-auditing tooling can use searchable indices such as AI Watchdog as an initial signal for artist outreach or internal audit, while recognising that a presence in a dataset does not by itself indicate downstream model usage without further traceability.

Key Points

1AI Watchdog updates let artists and researchers scan four public music datasets, revealing two sets of 12M and 9M tracks respectively.
2Industry pattern: public lists of streaming links lower the technical barrier to creating large audio corpora, raising provenance and licensing risks.
3Practical impact: searchable indices increase visibility for rights holders and will likely feed legal and platform-enforcement activity.

Scoring Rationale

A searchable tool for artists to audit AI training datasets addresses a concrete copyright transparency gap and will likely feed ongoing litigation against companies like Suno. The score is trimmed from 7.0 to 6.5 - it is a meaningful practical development but primarily a dataset transparency tool update rather than a major technical or regulatory milestone.

MoreAI Privacy news

Sources

Primary source and supporting public references used for this report.

5 sources

Primary sourcethequietus.comNew Tool Launched to Help Artists Track Use of Their Music by AI Companies

View 4 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems