Authors sue Anthropic seeking more than $75M
More than 100 authors and rights holders sued Anthropic in Northern California on June 17, 2026, seeking up to $150,000 per work over alleged copying and BitTorrent distribution of copyrighted books used around Claude training. The complaint and plaintiff counsel frame the case as an opt-out challenge following the earlier Bartz settlement, while the New York Post reported the damages demand as more than $75 million. For ML teams, the practical issue is not only fair-use doctrine; it is whether dataset acquisition, retention, and redistribution records can survive discovery. Treat book, web, and media corpora as governed assets with provable licenses, chain-of-custody logs, and deletion controls.
The practitioner lesson is that copyright risk around model training is becoming an evidence-management problem. Courts and litigants are now scrutinizing how datasets were acquired, where copies were stored, whether files were redistributed, and whether a company can prove which works entered a training or retention pipeline. That makes provenance controls part of ML infrastructure, not a legal afterthought.
What happened
On June 17, 2026, more than 100 authors and rights holders filed Shakespeare et al. v. Anthropic in the US District Court for the Northern District of California. The complaint alleges that Anthropic used BitTorrent to download works from Library Genesis and Pirate Library Mirror, stored books in a central library, and uploaded copies to other BitTorrent users during the process. The plaintiffs include Nolan Bushnell, Laura Esquivel, Tiffany Aliche, Donna Barba Higuera, and other writers or rights holders listed in the filing.
Policy context
The New York Post reported the damages demand as more than $75 million, based on plaintiffs seeking statutory damages of up to $150,000 per work. Plaintiff counsel describes the case as an opt-out action following the earlier Bartz v. Anthropic settlement. Prior reporting on Bartz centered on a split legal question: courts have treated lawful training and allegedly pirated acquisition differently, so the factual record around acquisition, storage, and redistribution matters.
For practitioners
Model builders should assume that large text corpora may need auditable chain-of-custody evidence. The practical controls are mundane but important: source-level license records, crawler and torrent exclusion policies, corpus manifests, retention rules, deletion logs, and review checkpoints before data moves into training or evaluation stores. If a corpus is rebuilt, deduplicated, or sampled, teams should preserve enough metadata to explain what changed and why.
What to watch
The next signal is whether the defendants challenge the complaint on pleadings, settlement scope, or fair-use grounds, and whether the court treats the alleged BitTorrent distribution theory separately from training-use arguments. For AI teams, the operational question is whether copyright compliance moves from policy documents into required dataset observability and data-governance tooling.
Key Points
- 1The new Anthropic complaint shifts attention from model outputs to dataset acquisition, retention, and alleged BitTorrent distribution.
- 2Statutory damages claims make copyright provenance a board-level risk, even before courts resolve fair-use questions.
- 3ML teams should preserve license records, crawler logs, corpus snapshots, and deletion evidence for every high-value training dataset.
Scoring Rationale
A new multi-plaintiff copyright complaint against Anthropic is notable because it raises concrete data-provenance, storage, and alleged redistribution issues for model builders. The case is not industry-shaking by itself, but it reinforces a major compliance risk around training corpora and justifies a modest lift from the prior score.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
