Backend Development 8 min read

How to Bypass Anti‑Scraping Measures: Delays, Headers, Proxies & Distributed Crawling

This guide explains practical techniques to avoid IP bans and 403 errors when web‑scraping, covering explicit and implicit waiting, User‑Agent spoofing, proxy usage, IP pools, and distributed crawling architectures.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
How to Bypass Anti‑Scraping Measures: Delays, Headers, Proxies & Distributed Crawling

Method 1: Set Waiting Time

Some websites detect bots by rapid requests such as fast image downloads, login attempts, or data extraction.

Two kinds of waiting are common: explicit waiting (fixed seconds) and implicit waiting (wait until a condition is met).

1. Explicit waiting

<code>import time  # import package
time.sleep(3)  # set interval to 3 seconds</code>

Collect data at night and avoid crawling too quickly to reduce detection.

2. Implicit waiting

Use wait.until() to pause until page elements are fully loaded before proceeding.

<code>wait1.until(lambda driver: driver.find_element_by_xpath("//div[@id='link-report']/span"))</code>

This prevents errors caused by missing elements when the crawler runs too fast.

Method 2: Modify Request Headers

The User‑Agent header is a key indicator for distinguishing browsers from scripts. Below is an example of changing it with urllib2 .

<code>import urllib2
req = urllib2.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')
response = urllib2.urlopen(req)</code>

Method 3: Use Proxy IP

When your IP is blocked, switch to a proxy.

Simple proxy example:

<code># -*- coding: utf-8 -*-
import urllib2
url = "http://www.ip181.com/"
proxy_support = urllib2.ProxyHandler({'http':'121.40.108.76'})
opener = urllib2.build_opener(proxy_support)
opener.add_handler=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')]
urllib2.install_opener(opener)
response = urllib2.urlopen(url)
print response.read().decode('gbk')</code>

The test site http://www.ip181.com shows the detected IP, confirming the proxy works.

To avoid a single proxy failure, build an IP pool.

<code># -*- coding: utf-8 -*-
import urllib2
import random
ip_list=['119.6.136.122','114.106.77.14']
url = "http://www.ip181.com/"
proxy_support = urllib2.ProxyHandler({'http':random.choice(ip_list)})
opener = urllib2.build_opener(proxy_support)
opener.add_handler=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')]
urllib2.install_opener(opener)
response = urllib2.urlopen(url)
print response.read().decode('gbk')</code>

Creating an IP pool involves collecting anonymous proxies, cleaning them (e.g., testing against a simple status page), storing the usable ones in a list, and removing dead entries as needed.

Method 5: Distributed Crawling

For large‑scale crawling systems, follow these steps:

Basic HTTP fetching tools such as Scrapy.

Avoid duplicate fetching with a Bloom Filter.

Maintain a distributed queue shared across cluster machines.

Integrate the distributed queue with Scrapy.

Post‑processing: content extraction (e.g., python‑goose) and storage (e.g., MongoDB).

Conclusion

1. The proxy examples work only while the proxies are fresh; choose up‑to‑date proxies for experiments.

2. The author mainly uses header spoofing combined with proxy IPs; for JavaScript‑heavy sites, Selenium + PhantomJS/Firefox is preferred.

proxyPythonWeb ScrapingSeleniumanti-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.