News Publishers Block Internet Archive Over AI Scraping

Nieman Lab, Hackaday and others report that a growing number of publishers have blocked the Internet Archive's Wayback Machine from crawling and archiving their sites, citing fears that AI companies could scrape their content. Named blockers include The New York Times, The Guardian and The Atlantic; some, like The Baltimore Banner, say they mainly worry AI chatbots would improperly cite them. By some counts the blocks now affect more than 340 news outlets. The Wayback Machine's director, Mark Graham, called the concerns 'understandable, but unfounded' and the archive 'collateral damage' in the publisher-AI dispute, arguing the risk comes through the Archive's own controllable interfaces, not from crawling. Reporting notes paid vendors like ProQuest and LexisNexis remain allowed, and the EFF warns that blocking the Archive will not stop AI training but will erase part of the web's historical record - a real concern for dataset provenance.
What happened
A growing number of news publishers have blocked the Internet Archive's Wayback Machine from crawling and archiving their sites, citing fears that AI companies could scrape their content. Reporting names blockers including The New York Times, The Guardian and The Atlantic; some outlets, like The Baltimore Banner, say they are mainly worried that AI chatbots would improperly cite their content. By some counts the blocks now affect more than 340 news outlets (Straight Arrow News).
The Archive's pushback
The Wayback Machine's director, Mark Graham, called the publishers' concerns 'understandable, but unfounded,' and described the archive as 'collateral damage' in the dispute between publishers and AI companies (Internet Archive blog, February and May 2026). Graham argues the risk comes through the Archive's own interfaces - which it controls and rate-limits - not from crawling and preserving pages. Nieman Lab's Andrew Deck reported that publishers have acted preemptively, with none citing specific evidence of AI scraping via the Wayback Machine.
Why it matters
Web archiving relies on crawlers obeying robots.txt and site allowlists, so a publisher-side change can immediately halt snapshot collection. For researchers and ML teams, that is a material risk to dataset provenance and reproducibility: corpora and ground-truth snapshots built from public web archives can develop gaps as major domains opt out. Commentators note that paid archival vendors such as ProQuest and LexisNexis remain allowed, which shifts historical access toward licensed, paywalled services and raises costs for independent research. The EFF argues that blocking the Archive will not stop AI training but will erase part of the public record used by researchers, historians and fact-checkers.
What to watch
Track robots.txt and allowlist changes for major news domains used in datasets, whether publishers narrow blocks to specific crawlers or keep blanket restrictions, any licensing agreements between publishers and archival or AI companies, and how dataset maintainers document archive-introduced gaps.
Key Points
- 1A growing number of publishers (NYT, The Guardian, The Atlantic and others) block the Wayback Machine over AI-scraping fears; by some counts 340+ outlets are now affected.
- 2Internet Archive director Mark Graham calls the concerns 'understandable, but unfounded' and the archive 'collateral damage,' noting no publisher has shown evidence of scraping via the Wayback Machine.
- 3For ML/data practitioners: publisher robots.txt changes are a measurable risk to dataset provenance and reproducibility, pushing historical access toward paid vendors like ProQuest and LexisNexis.
Scoring Rationale
Substantive, heavily-sourced policy story directly relevant to ML/data practitioners: publishers (NYT, Guardian, Atlantic and 340+ outlets by some counts) blocking the Wayback Machine over AI-scraping fears threatens dataset provenance and reproducibility, with paid vendors remaining the paywalled alternative. Notable and well-corroborated, though an evolving aggregated situation rather than a single hard event, so trimmed 6.9 to 6.6. Verified both Graham quotes, added the Hackaday trigger and IA 'collateral damage' post, removed the forbidden trailing sources section and an unverifiable podcast attribution.
Sources
Public references used for this report.
View 7 more sources
- 04News Sites Are Blocking Internet Archive Over AI Scraping Fearshackaday.com
- 05The Internet's Most Powerful Archiving Tool Is in Perilwired.com
- 06More than 340 newspapers block the Wayback Machine. What that means for the future of the internetsan.com
- 07News sites are blocking access to Internet Archive's Wayback Machinemarketplace.org
- 08Blocking the Internet Archive Won't Stop AI, But It Will Erase the ...eff.org
- 09The Internet Archive is in danger - The Weektheweek.com
- 10Tell New York Times, The Atlantic, and USA Today to keep the crucial work of journalists in the Wayback Machine!savethearchive.com
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
