Databricks Faces Authors' Copyright Claims Over DBRX

Federal judge Charles Breyer denied defendants' motion to dismiss a class action alleging that Databricks' LLM series DBRX was trained using copyrighted books drawn from shadow-library datasets, according to reporting by The Register and legal coverage in VitalLaw and Mealeys. Plaintiffs include named authors such as Stewart O'Nan, Abdi Nazemian, Brian Keene, Rebecca Makkai, and Jason Reynolds, per Publishers Marketplace. The suit traces training data to the RedPajama/Books3 corpus and alleges roughly 196,000 titles were implicated, sources including The Register and Saveri Law Firm report. The complaint survived the motion to dismiss after plaintiffs filed an amended complaint tying the alleged copying to the DBRX models, according to court filings summarized by VitalLaw.
What happened
According to reporting in The Register and legal summaries at VitalLaw and Mealeys, U.S. District Judge Charles Breyer denied a motion to dismiss direct copyright claims brought by a group of authors against Databricks and MosaicML (case No. 3:24-cv-01451-CRB), allowing the claims tied to the DBRX series of large language models to proceed. Per Publishers Marketplace, the named plaintiffs include Stewart O'Nan, Abdi Nazemian, Brian Keene, Rebecca Makkai, and Jason Reynolds. The complaints allege that model training incorporated material from so-called "shadow libraries," and specifically from the RedPajama dataset component known as Books3, which multiple sources report was removed from Hugging Face in October 2023 amid copyright concerns.
Technical details
Per the Saveri Law Firm materials and the plaintiffs' amended complaint summarized in VitalLaw, the authors allege MosaicML used the RedPajama-Books (Books3) corpus when training early MPT models such as MPT-7B and MPT-30B, and that Databricks later acquired MosaicML in July 2023. The Register reports that early versions of DBRX were developed by the Mosaic team and that plaintiffs contend copying occurred during early development and training stages. The Saveri summary states Books3 derives from a copy of the Bibliotik collection, which the complaint describes as a shadow-library source of unlicensed copyrighted works.
Editorial analysis - technical context
Industry-pattern observations: Litigation over training data commonly focuses on whether copyrighted text can be traced into model training and whether model outputs constitute copying under existing copyright law. Courts assessing similar claims typically examine pleadings, deposition testimony, and discovery about dataset provenance, preprocessing, and training steps to determine whether plaintiffs have plausibly alleged direct infringement. The current decision to let infringement claims tied to DBRX proceed aligns with other recent rulings that require fact development before dispositive dismissal on complex machine-learning training chains.
Context and significance
Industry context
The court's decision does not resolve liability; it permits discovery that could surface factual links between specific copyrighted works and weights or outputs of models. Legal coverage in The Register quotes Judge Breyer: "They directly tie their infringed works to DBRX, and the employee statements provide supporting inferences when read in context, particularly when viewed alongside other more direct statements." That phrasing indicates the judge found the amended complaint sufficient at the pleading stage. Observers and counsel cited in reporting note the court wants additional factual development before concluding whether defendants engaged in infringing copying, per The Register and Mealeys.
For practitioners
Industry-pattern observations: Machine-learning teams and legal counsel working with third-party datasets should expect heightened scrutiny of dataset provenance and retention of evidence about data sources and preprocessing. Recent filings and the court's order illustrate that discovery over dataset lineage, ingestion logs, and model-development histories can become central in copyright litigation involving LLMs.
What to watch
Reporting by VitalLaw and The Register indicates the case will move into discovery, where plaintiffs and defendants can seek depositions and internal documents; The Register notes Databricks has produced depositions and documents but the judge has asked for more information. Observers will watch whether discovery uncovers explicit links between Books3 content and producible model artifacts, whether defendants succeed on later dispositive motions, and whether any rulings set precedents on when model training on copyrighted corpora crosses the line into direct infringement.
Additional reported procedural detail
Multiple legal reports note that defendants earlier sought dismissal on the ground that plaintiffs did not sufficiently tie alleged copying to the finalized DBRX models; the court allowed plaintiffs to amend and then denied the motion to dismiss the amended claim, per VitalLaw. The Saveri Law Firm summary provides background on how Books3 and RedPajama are alleged to have been used in MosaicML's public disclosures about training data for MPT-series models.
Scoring Rationale
The ruling is a notable legal development that keeps direct copyright claims against major AI model developers alive, increasing discovery-driven risk for dataset provenance. This matters to ML teams and legal counsel but is not yet a precedent-setting final judgment.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
