Information Security 6 min read

Multiple Anti‑Crawling Measures and Best Practices for Web Scraping

The article outlines several anti‑crawling techniques—including IP restrictions, User‑Agent validation, CAPTCHAs, AJAX loading, noscript tags, and cookie checks—while also offering practical advice for writing ethical, efficient, and robust web crawlers.

Python Programming Learning Circle

Apr 8, 2021

Multiple Anti‑Crawling Measures and Best Practices for Web Scraping

1. IP Restriction – Detect unusually frequent requests from a single IP and temporarily block that IP to prevent automated scraping.

2. User‑Agent Validation – Identify the browser or client via its User‑Agent string; legitimate browsers have known patterns, while unknown or suspicious agents can be blocked. Example User‑Agent strings include:

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

Legitimate User‑Agents vary across browsers and operating systems; any request lacking a normal User‑Agent is likely a crawler and can be restricted.

3. CAPTCHA Verification – Deploy CAPTCHAs to block automated programs; modern CAPTCHAs use complex noise and distortion that are difficult for bots to solve.

4. AJAX Asynchronous Loading – Load content dynamically via JavaScript, making it harder for simple crawlers that do not execute JS to retrieve data.

5. <noscript> Tag Usage – Serve alternative content when JavaScript is disabled; many low‑level crawlers lack JS support, so combining <noscript> with AJAX can protect sensitive information.

6. Cookie Restrictions – Monitor cookie presence across multiple visits; if a client repeatedly lacks expected cookies, it may be a crawler and can be blocked.

Crawler Development Tips

Respect ethical considerations and obey the robots.txt protocol.

Avoid infinite loops by parsing URLs carefully with tools like urlparse.

Set reasonable request timeouts (e.g., shorter than the default 200 seconds) to prevent thread bottlenecks.

Implement efficient duplicate‑detection to avoid revisiting the same pages.

Consider a download‑first, analysis‑later workflow to speed up crawling.

Handle resource deadlocks in asynchronous code.

Use precise element selectors (e.g., XPath) to reduce dirty data.

Images illustrating User‑Agent examples and CAPTCHA samples are included in the original article.

Disclaimer: The content is compiled from online sources; copyright belongs to the original author. Please contact us for removal or authorization if any rights are infringed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

captcha anti‑crawling User-Agent IP blocking Scraping

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.