Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era
From a 1993 accidental DoS attack that sparked the creation of robots.txt to modern AI crawlers ignoring the protocol, this article traces the history, purpose, and challenges of the robots exclusion standard and explores new proposals to adapt it for AI-driven web scraping.
Origin of robots.txt
In 1993, while writing a Perl web crawler, the author unintentionally caused a denial‑of‑service attack on a small site with a 14.4 Kbps line. Martijn Koster, who later created the AliWeb search engine, asked the crawler to stop. To prevent uncontrolled crawlers from overwhelming servers, Koster proposed the Robots Exclusion Protocol (robots.txt). Crawlers fetch a robots.txt file from the site root, parse its directives, and avoid the listed paths.
How robots.txt works
A typical file:
User-agent: googlebot
Disallow: /private/The directive tells Googlebot not to crawl any URL under /private/. The protocol relies on voluntary compliance; there is no technical enforcement.
Historical success and challenges
For three decades the protocol functioned because non‑compliant bots were publicly blacklisted, risking reputation and traffic loss. In the late 1990s, search engines and sites exchanged value: sites allowed crawlers, gaining exposure; crawlers respected the rules.
Some aggregators ignored the protocol. Example: Bidder’s Edge scraped eBay despite eBay’s Disallow rules, using proxy servers to bypass IP blocks. eBay sued, and in May 2000 a court injunction prohibited Bidder’s Edge from any automated data extraction of eBay.
AI‑driven crawling and the protocol’s weakness
Large language models require massive training data. Companies that obey robots.txt may fall behind competitors that stealthily change user‑agents or IPs to harvest more data. This creates a dilemma: strict adherence can hurt competitiveness, while ignoring the protocol can lead to loss of traffic, attribution, and revenue.
Originality.AI (2023) reported that among the top‑1000 sites, 306 block OpenAI’s GPTBot and 85 block Google‑Extended. Major news outlets (BBC, New York Times) and platforms (Medium) also block AI crawlers.
Proposed extension for the AI era
In 2025 the non‑profit RSL Collective introduced “Really Simple Licensing” (RSL), extending robots.txt with explicit AI directives. Example RSL file:
User-agent: *
Allow: /
# AI performing search/indexing is allowed
AI-Search: allowed
# Disallow using site content to train general models
AI-Training: disallowed
# Summarization is allowed with attribution
AI-Summarization: allowed-with-attribution
# Commercial use requires a license
AI-Commercial: license-requiredThis turns the protocol from a courtesy into a licensing framework, allowing site owners to specify how AI may interact with their content.
Conclusion
The reliance on voluntary compliance makes robots.txt vulnerable in the AI era. Whether major players will adopt RSL extensions remains open, highlighting the need for a modern, enforceable framework that balances open web access with content owners’ rights.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
