How to Bypass Common Anti‑Scraping Measures with Scrapy
This guide explains why websites employ anti‑scraping defenses, outlines the most common header checks such as User‑Agent, Referer, and Cookies, and provides practical Scrapy code snippets for rotating user agents, managing proxies, handling X‑Forwarded‑For, limiting request rates, and dealing with dynamic AJAX content using Selenium or PhantomJS.
Why Anti‑Scraping Matters
In the era of big data, many companies protect their sites with anti‑scraping mechanisms to prevent data theft, but overly strict defenses can also block legitimate users. Balancing strong protection with low false‑positive rates raises development costs.
Header Validation
Simple anti‑scraping checks examine HTTP request headers such as User‑Agent , Referer , and Cookies .
User‑Agent Rotation
Scrapy can randomize the User‑Agent header in a downloader middleware. Example middleware:
class RandomUserAgentMiddleware(object):
@classmethod
def process_request(cls, request, spider):
ua = random.choice(spider.settings['USER_AGENT_LIST'])
if ua:
request.headers.setdefault('User-Agent', ua)This selects a real browser string for each request.
Referer Handling
The Referer header indicates the source page. Scrapy automatically sets it when a URL is extracted from a previously crawled page, but it can also be overridden manually.
Cookies Management
Some sites limit requests based on the session_id cookie. Disabling cookies globally with COOKIES_ENABLED = False prevents sending them. If a site forces cookies, you can capture Set‑Cookie from responses and resend them in subsequent requests.
IP Rate Limiting and Proxies
When an IP makes requests too quickly, anti‑scraping triggers. You can slow down crawling or use rotating proxies. Adding a proxy in Scrapy:
request.meta['proxy'] = 'http://' + proxy_host + ':' + proxy_portAcquiring a large pool of proxies often requires a custom scraper that periodically fetches free proxy lists, validates them, and maintains a dynamic proxy pool.
For authenticated proxies, encode credentials in Base64 and set the Proxy-Authorization header:
import base64
proxy_string = choice(self._get_proxies_from_file('proxies.txt')) # user:pass@ip:port
proxy_items = proxy_string.split('@')
request.meta['proxy'] = "http://%s" % proxy_items[1]
user_pass = base64.encodestring(proxy_items[0])
request.headers['Proxy-Authorization'] = 'Basic ' + user_passHandling Dynamic AJAX Content
Many modern sites load data via AJAX, returning JSON that can be fetched directly if the API endpoint is known. When AJAX calls are protected by backend authentication, use PhantomJS or Selenium to render the page and capture the generated content.
When switching to Selenium, you must re‑apply any custom headers because Scrapy’s downloader no longer handles them:
headers = {...}
for key, value in headers.iteritems():
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.%s' % key] = valueSpecify the PhantomJS executable path explicitly to avoid environment‑variable issues in scheduled jobs:
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
