Infrastructurecloudflareweb crawlingai trainingdata provenance

Cloudflare Blocks Mixed-Use Crawlers on Monetized Pages

|July 1, 2026|By LDS Team

7.2

Relevance Score

Cloudflare Blocks Mixed-Use Crawlers on Monetized Pages — Photo: image.theregister.com · rights & takedowns

Cloudflare will default to blocking mixed-use crawlers, bots that gather data for both search indexing and AI training, from ad-supported customer pages starting September 15, 2026, per Cloudflare's blog and reporting by The Register and CJR. For practitioners, the change shifts web-crawl access from implicit to explicit and raises the operational cost of sourcing web-scale training data, since pipelines built on bulk crawling will need opt-in checks, credentials, or paid access. Cloudflare, which handles roughly 20 percent of web traffic per CJR, is also piloting a "pay-per-crawl" feature letting publishers charge AI companies for access. Cloudflare cites 2025 crawl-to-referral ratios, Google 14:1, OpenAI 1,700:1, Anthropic 73,000:1, to argue the historic crawl-for-traffic bargain has broken down for AI crawlers.

Cloudflare's move to change default crawler access matters for practitioners because it directly affects how reliable, lawful, and affordable web-scale training data will be in the near term. Machine-learning teams that source large quantities of HTML for pretraining or retrieval-augmented generation should treat crawl access and contractual terms as an explicit line item in data pipelines.

What happened

Per Cloudflare's blog post, Cloudflare said it will provide managed robots.txt controls and an option to prevent crawlers from accessing portions of sites that are monetized through ads. Reporting by The Register (July 1, 2026) states Cloudflare will soon default to blocking "mixed-use" crawlers from ad-supported customer sites, with a rollout date of September 15, 2026 for new customers and new sites on existing accounts. This means crawlers that gather data for both search indexing and AI training will be blocked from monetized pages by default. CJR reports that Cloudflare manages traffic for about 20 percent of the web. Multiple outlets report Cloudflare is piloting a "pay-per-crawl" feature that would allow site owners to charge AI companies for crawl access. Supporting data from Cloudflare's prior July 2025 analysis showed crawl-to-referral ratios of approximately Google 14:1, OpenAI 1,700:1, and Anthropic 73,000:1; these figures were used to justify the policy direction, with the 2026 announcement extending that default to ad-supported page targeting. Cloudflare co-founder and CEO Matthew Prince said, per The Register: "Now that the majority of traffic on the Internet is non-human, we must go further and act faster so that a sustainable ecosystem can emerge."

Technical context

From a data engineering perspective, this is an operational inflection point for web-derived corpora. Practitioners assembling training or evaluation datasets from public web pages commonly rely on bulk crawling and subsequent deduplication, license filtering, and provenance tagging. When a large infrastructure provider changes default access rules, those same pipelines must incorporate explicit opt-in checks, crawl credentials, or paid access workflows, raising the operational cost of negotiating access, tracking provenance, and proving compliance with site-level restrictions. The emergence of pay-per-crawl as an offered mechanism makes the commercial cost of web harvesting explicit: teams that currently estimate training data costs as primarily compute and storage now have licensing fees or per-crawl charges as a new variable. Clearer crawler intent labels (search vs. training vs. inference) may improve auditability, though that depends on crawler operators honoring declared intents.

Industry context

Reporting by CJR highlights publisher support for Cloudflare's change, naming organizations including The Associated Press, Time, The Atlantic, and Reddit as participants in publisher advocacy. Coverage frames Cloudflare as the first major infrastructure provider to change the default; Transparency Coalition coverage emphasizes the potential for this to curb unconsented scraping at scale. For model builders who depend on broad web snapshots, the policy increases legal and reputational risk if crawls ignore site-level controls, and creates a visible negotiation point for data access and compensation.

What to watch

Adoption metrics (how many Cloudflare customers enable the default block), crawler operator compliance with the new gate, rollout of pay-per-crawl pricing and enforcement, and whether alternative data suppliers or licensed content marketplaces accelerate. Also watch for updated crawl-to-referral figures and any public agreements between publishers and major AI companies clarifying permitted uses.

Key Points

1Cloudflare's default block and managed robots.txt shift web-crawl access from implicit to explicit, raising operational costs for dataset builders.
2Published crawl-to-referral ratios (Google 14:1; OpenAI 1,700:1; Anthropic 73,000:1) illustrate how AI crawlers return far less referral traffic than search engines.
3Pay-per-crawl pilots convert previously invisible bot traffic into a potential revenue stream, creating new negotiation points between publishers and AI firms.

Scoring Rationale

Cloudflare's move to block mixed-use crawlers from ad-supported pages by default directly affects web-scale data collection for AI training, raising pipeline costs and compliance complexity for ML teams. Cloudflare's 20-percent web coverage makes this operationally significant for anyone building training or RAG corpora from public web pages. Notable infrastructure policy shift; less than industry-shaking because the trend has been signposted since the July 2025 Content Independence Day announcement.

MoreAI Infrastructure news

Sources

Primary source and supporting public references used for this report.

5 sources

Primary sourcetheregister.comCloudflare to block cynical search-and-scrape bots from ad-supported web pages

View 4 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems