Information Security 12 min read

How to Defend Your Website Against Web Crawlers: Techniques & Tools

This article explores why web content needs protection, explains common server‑side and client‑side anti‑crawling methods—including User‑Agent checks, token cookies, headless‑browser detection, fingerprinting, captchas, and robots.txt—and offers practical guidance for raising the cost of unauthorized scraping.

MaGe Linux Operations

Dec 5, 2017

How to Defend Your Website Against Web Crawlers: Techniques & Tools

Web is an open platform that has driven its rapid growth since the early 1990s, but its openness also makes content vulnerable to low‑cost, low‑skill crawling programs that can steal copyrighted material.

To protect original web content, anti‑crawling measures are essential.

From the crawler attack and defense perspective

The simplest crawler sends an HTTP GET request to a page URL and receives the full HTML, known as a “synchronous page.” Servers can inspect the User-Agent header to decide whether to serve real content, but crawlers can easily spoof this header as well as Referrer, Cookie, and other fields.

More advanced server‑side detection uses full header fingerprinting; for example, PhantomJS 1.x reveals Qt network request signatures that can be blocked.

Another technique embeds a token cookie in the HTTP response and requires subsequent AJAX calls to return the token, proving the visitor is a real browser. Sites like Amazon employ this method.

Client‑side JavaScript runtime detection

Modern browsers allow core content to be loaded via asynchronous ajax requests, raising the crawling barrier. Headless browsers—such as PhantomJS, SlimerJS, trifleJS, and the newer headless Chrome—enable crawlers to render pages, but they can be identified through various browser‑specific checks.

Typical headless‑browser detection includes checks of plugin objects, language settings, WebGL capabilities, hairline rendering features, and error image‑src handling:

Based on plugin object

Based on language

Based on WebGL

Based on hairline features

Based on erroneous img src

These checks can defeat most headless browsers, forcing attackers to modify browser engine code.

Further client‑side fingerprinting examines the User-Agent string and the properties of native DOM and BOM objects to verify consistency with a real browser. Attackers may inject fake JavaScript to spoof these features.

One subtle evasion technique involves wrapping native APIs in proxy functions and overriding toString checks, which can bypass simple native‑code detection.

Anti‑crawling silver bullet

The most reliable defense remains CAPTCHAs, especially behavior‑based solutions like Google reCAPTCHA, which can verify mouse or touch interactions without demanding text entry.

Sites typically block offending IP addresses or apply aggressive CAPTCHA challenges, pushing attackers to purchase proxy pools, thereby raising the economic cost of scraping.

Robots protocol

Legitimate crawlers can be guided by the robots.txt file, using Allow and Disallow directives (e.g., GitHub’s policy). However, this is a voluntary agreement that only commercial search engines respect; malicious scrapers often ignore it.

Conclusion

Web crawling and anti‑crawling are an ongoing cat‑and‑mouse game; no single technique can completely block crawlers, but layered defenses increase the effort required for unauthorized scraping.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

captcha anti‑crawling web crawling Headless Browser Browser Fingerprinting robots.txt

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.