Fundamentals 9 min read

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

From a 1993 accidental DoS attack that sparked the creation of robots.txt to modern AI crawlers ignoring the protocol, this article traces the history, purpose, and challenges of the robots exclusion standard and explores new proposals to adapt it for AI-driven web scraping.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

Origin of robots.txt

In 1993, while writing a Perl web crawler, the author unintentionally caused a denial‑of‑service attack on a small site with a 14.4 Kbps line. Martijn Koster, who later created the AliWeb search engine, asked the crawler to stop. To prevent uncontrolled crawlers from overwhelming servers, Koster proposed the Robots Exclusion Protocol (robots.txt). Crawlers fetch a robots.txt file from the site root, parse its directives, and avoid the listed paths.

illustration
illustration

How robots.txt works

A typical file:

User-agent: googlebot
Disallow: /private/

The directive tells Googlebot not to crawl any URL under /private/. The protocol relies on voluntary compliance; there is no technical enforcement.

illustration
illustration

Historical success and challenges

For three decades the protocol functioned because non‑compliant bots were publicly blacklisted, risking reputation and traffic loss. In the late 1990s, search engines and sites exchanged value: sites allowed crawlers, gaining exposure; crawlers respected the rules.

Some aggregators ignored the protocol. Example: Bidder’s Edge scraped eBay despite eBay’s Disallow rules, using proxy servers to bypass IP blocks. eBay sued, and in May 2000 a court injunction prohibited Bidder’s Edge from any automated data extraction of eBay.

AI‑driven crawling and the protocol’s weakness

Large language models require massive training data. Companies that obey robots.txt may fall behind competitors that stealthily change user‑agents or IPs to harvest more data. This creates a dilemma: strict adherence can hurt competitiveness, while ignoring the protocol can lead to loss of traffic, attribution, and revenue.

Originality.AI (2023) reported that among the top‑1000 sites, 306 block OpenAI’s GPTBot and 85 block Google‑Extended. Major news outlets (BBC, New York Times) and platforms (Medium) also block AI crawlers.

Proposed extension for the AI era

In 2025 the non‑profit RSL Collective introduced “Really Simple Licensing” (RSL), extending robots.txt with explicit AI directives. Example RSL file:

User-agent: *
Allow: /
# AI performing search/indexing is allowed
AI-Search: allowed
# Disallow using site content to train general models
AI-Training: disallowed
# Summarization is allowed with attribution
AI-Summarization: allowed-with-attribution
# Commercial use requires a license
AI-Commercial: license-required

This turns the protocol from a courtesy into a licensing framework, allowing site owners to specify how AI may interact with their content.

illustration
illustration

Conclusion

The reliance on voluntary compliance makes robots.txt vulnerable in the AI era. Whether major players will adopt RSL extensions remains open, highlighting the need for a modern, enforceable framework that balances open web access with content owners’ rights.

search engineProtocolWeb CrawlingAI ethicsrobots.txt
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.