Backend Development 7 min read

How to Bypass Common Anti‑Scraping Measures with Scrapy

This guide explains why websites employ anti‑scraping defenses, outlines the most common header checks such as User‑Agent, Referer, and Cookies, and provides practical Scrapy code snippets for rotating user agents, managing proxies, handling X‑Forwarded‑For, limiting request rates, and dealing with dynamic AJAX content using Selenium or PhantomJS.

ITPUB

May 2, 2017

How to Bypass Common Anti‑Scraping Measures with Scrapy

Why Anti‑Scraping Matters

In the era of big data, many companies protect their sites with anti‑scraping mechanisms to prevent data theft, but overly strict defenses can also block legitimate users. Balancing strong protection with low false‑positive rates raises development costs.

Header Validation

Simple anti‑scraping checks examine HTTP request headers such as User‑Agent , Referer , and Cookies .

User‑Agent Rotation

Scrapy can randomize the User‑Agent header in a downloader middleware. Example middleware:

class RandomUserAgentMiddleware(object):
    @classmethod
    def process_request(cls, request, spider):
        ua = random.choice(spider.settings['USER_AGENT_LIST'])
        if ua:
            request.headers.setdefault('User-Agent', ua)

This selects a real browser string for each request.

Referer Handling

The Referer header indicates the source page. Scrapy automatically sets it when a URL is extracted from a previously crawled page, but it can also be overridden manually.

Cookies Management

Some sites limit requests based on the session_id cookie. Disabling cookies globally with COOKIES_ENABLED = False prevents sending them. If a site forces cookies, you can capture Set‑Cookie from responses and resend them in subsequent requests.

IP Rate Limiting and Proxies

When an IP makes requests too quickly, anti‑scraping triggers. You can slow down crawling or use rotating proxies. Adding a proxy in Scrapy:

request.meta['proxy'] = 'http://' + proxy_host + ':' + proxy_port

Acquiring a large pool of proxies often requires a custom scraper that periodically fetches free proxy lists, validates them, and maintains a dynamic proxy pool.

For authenticated proxies, encode credentials in Base64 and set the Proxy-Authorization header:

import base64
proxy_string = choice(self._get_proxies_from_file('proxies.txt'))  # user:pass@ip:port
proxy_items = proxy_string.split('@')
request.meta['proxy'] = "http://%s" % proxy_items[1]
user_pass = base64.encodestring(proxy_items[0])
request.headers['Proxy-Authorization'] = 'Basic ' + user_pass

Handling Dynamic AJAX Content

Many modern sites load data via AJAX, returning JSON that can be fetched directly if the API endpoint is known. When AJAX calls are protected by backend authentication, use PhantomJS or Selenium to render the page and capture the generated content.

When switching to Selenium, you must re‑apply any custom headers because Scrapy’s downloader no longer handles them:

headers = {...}
for key, value in headers.iteritems():
    webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.%s' % key] = value

Specify the PhantomJS executable path explicitly to avoid environment‑variable issues in scheduled jobs:

driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy web scraping Scrapy Headers anti-scraping

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.