Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era
The article recounts a personal experiment that unintentionally launched a DoS attack, explains how that incident spurred the creation of the robots.txt protocol, and examines how AI‑driven data scraping, legal battles, and new licensing proposals are challenging its relevance today.
In the early 1990s the author began receiving a daily "what's new on the web" newsletter and, out of boredom, taught himself Perl to write a web crawler for tasks such as site indexing and dead‑link checking.
Testing the crawler on a tiny company’s website—hosted on a 14.4 Kbps line—accidentally generated a denial‑of‑service attack, prompting the site owner, Martijn Koster, to demand an immediate stop.
Koster, who later invented the world’s first web search engine AliWeb, responded by proposing a voluntary standard called the Robots Exclusion Protocol (robots.txt). The idea was simple: before a crawler accesses a new site, it must first fetch and obey a robots.txt file that lists disallowed paths.
How robots.txt works
A typical file looks like this:
User-agent: googlebot
Disallow: /private/This tells the Googlebot crawler not to fetch any resources under /private/. Compliance is voluntary—like a sign saying "No entry"—and it relies on the goodwill of crawler developers.
Three decades of success
For about 30 years the protocol functioned well because search engines needed site traffic and sites needed visibility. The mutual benefit kept both parties honest, and the protocol became a de‑facto standard without any formal committee.
Emerging challenges
With the explosive growth of the web in the late 1990s, many sites became invisible without search engine indexing. At the same time, aggressive aggregators such as Bidder’s Edge began scraping content for profit, often bypassing IP blocks with proxies. In 2000 a U.S. court ruled that such scraping constituted illegal intrusion.
More recently, AI companies have turned web content into training data. Non‑profit archives like the Internet Archive ignore robots.txt to preserve the web, while commercial AI services (e.g., OpenAI’s GPTBot) publicly claim to respect the file but only after their massive models are already trained.
Statistics from Originality.AI in 2023 show that among the top‑1000 sites, 306 block OpenAI’s GPTBot and 85 block Google‑Extended, illustrating a growing backlash.
Why the old model no longer fits
Robots.txt was built on three assumptions: good faith , search‑driven traffic , and reciprocal benefit . AI changes all three—models can consume data without returning traffic, and they can monetize the knowledge without any direct link back to the source.
Proposed evolution: Really Simple Licensing (RSL)
In 2025 the non‑profit RSL Collective introduced an upgraded syntax that lets site owners specify granular AI permissions:
User-agent: *
Allow: /
# AI can perform search/indexing
AI-Search: allowed
# Disallow training large models
AI-Training: disallowed
# Summarization allowed with attribution
AI-Summarization: allowed-with-attribution
# Commercial use requires a license
AI-Commercial: license-requiredThis turns robots.txt from a courtesy guide into a licensing statement, allowing owners to grant or deny AI‑specific actions while still permitting traditional search crawlers.
Outlook
Whether major AI players will honor RSL remains uncertain, but the shift highlights that the simple “sign‑post” model is insufficient for the data‑intensive AI landscape. Site owners must now consider explicit licensing to protect their content and revenue streams.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
