Researcher Launches Tool Tracking Music in AI Datasets

AI Watchdog, a searchable tool created by researcher Alex Reisner, has been updated to include four music datasets identified in The Atlantic investigation, according to reporting in The Verge and The Quietus. The largest two collections contain 12 million and 9 million tracks respectively, while the two smaller sets each contain over 100,000 tracks, per The Verge. The datasets have been downloaded thousands of times and include millions of copyrighted songs, reporting reproduced by Music Ally and The Verge notes. The Quietus reports that companies including Google and Stability have openly used these datasets in research papers, and that artist SZA posted an Instagram claim that "music AI has trained off 238 of my songs," quoted in The Quietus. Editorial analysis: Industry observers should expect increased scrutiny on dataset provenance and artist recourse as these findings enter public view.
What happened
AI Watchdog, a searchable tool maintained by researcher Alex Reisner, has been updated to show which songs appear in four large music datasets identified in reporting by The Atlantic, as summarised by The Verge and reported by The Quietus. The Verge reports that two of the datasets contain 12 million and 9 million tracks respectively, and that the two smaller datasets each contain over 100,000 tracks. Music Ally and The Verge reproduce The Atlantic's finding that these datasets have been downloaded thousands of times, and The Quietus reports that companies including Google and Stability have been named in research papers as users of some of these collections. The Quietus also cites an Instagram post from artist SZA, quoted there, claiming "music AI has trained off 238 of my songs."
Technical details
The Verge includes reporting from Alex Reisner describing how three of the datasets are distributed as public lists of links to tracks on streaming platforms such as YouTube and Spotify, and that developers commonly use automated tools to fetch the audio, a process that can bypass platform mechanisms, per The Verge. The Verge further notes that some source collections are streamable for personal use but require licensing for commercial use.
Editorial analysis - technical context: Datasets composed of link lists create a low-friction path from public streaming pages to model training corpora, which complicates provenance tracking for downstream models and increases legal and copyright risk vectors for data engineers and ML teams.
Context and significance
Reporting reproduced by Music Ally places these revelations amid ongoing litigation and rights disputes in the music-tech sector, noting similar questions at the core of lawsuits against companies such as Suno, where plaintiffs allege large-scale use of commercial recordings for model training. The public visibility of dataset composition via a tool like AI Watchdog raises the likelihood of greater scrutiny from rights holders, regulators, and plaintiffs' lawyers, according to the framing in the assembled coverage.
What to watch
whether more rights holders cite specific dataset membership in ongoing or new lawsuits, and whether major platform or model developers disclose dataset provenance more fully following this reporting. What to watch: updates to terms of service or enforcement by streaming platforms if automated download tools are shown to be widely used for large-scale scraping, as described in The Verge. What to watch: community responses from artists and creators, following vocal statements quoted in The Quietus and the broader reporting.
For practitioners: Data scientists and ML engineers working with audio should treat public lists-of-links datasets as high-risk from a provenance and licensing perspective and prioritize logging, provenance metadata, and legal review processes when assembling training corpora. For practitioners: teams building detection, watermarking, or dataset-auditing tooling can use searchable indices such as AI Watchdog as an initial signal for artist outreach or internal audit, while recognising that a presence in a dataset does not by itself indicate downstream model usage without further traceability.
Scoring Rationale
A searchable tool for artists to audit AI training datasets addresses a concrete copyright transparency gap and will likely feed ongoing litigation against companies like Suno. The score is trimmed from 7.0 to 6.5 - it is a meaningful practical development but primarily a dataset transparency tool update rather than a major technical or regulatory milestone.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
