How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

This guide outlines the main anti‑scraping techniques used by websites—including header validation, user‑behavior monitoring, and dynamic content loading—and provides practical methods such as header spoofing, IP proxy rotation, request throttling, and Selenium/PhantomJS automation to overcome them.

ITPUB
ITPUB
ITPUB
How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

Common Anti‑Scraping Techniques

Websites typically employ three categories of anti‑scraping defenses: inspection of request Headers , monitoring of user behavior (e.g., rapid repeated accesses from the same IP or account), and protection of data loaded via AJAX or other dynamic mechanisms.

Bypassing Header Checks

The most frequent defense is checking the User-Agent and sometimes the Referer header. To evade this, simply copy a real browser’s User‑Agent string into your crawler’s request headers and set the Referer to the target domain when required.

Bypassing Behavior‑Based Checks

Two typical behavior checks are:

Multiple rapid requests to the same page from a single IP address.

Multiple identical actions from the same account within a short time frame.

For the first case, use a pool of proxy IPs: fetch public proxy lists, validate them, store them, and rotate the proxy after a few requests (easily done with requests or urllib2 in Python).

For the second case, introduce random delays of a few seconds between requests, or log out and log back in to reset per‑account limits before continuing.

Handling Dynamic Pages

When data is generated via JavaScript or encrypted AJAX calls, static request simulation often fails. First, analyze network traffic with tools like Firebug or HttpFox to locate AJAX endpoints and their parameters. If the parameters are not encrypted, you can replicate the AJAX request with requests and parse the JSON response.

If the AJAX parameters are encrypted or the site heavily obfuscates its API, resort to a real browser automation framework. Using selenium together with PhantomJS (a headless browser) allows you to execute JavaScript, fill forms, click buttons, and scroll pages just as a human would, thereby bypassing most anti‑scraping measures.

This approach can also solve challenges such as click‑based or sliding CAPTCHAs, form brute‑forcing, and other interactive protections, making it a versatile solution for complex sites.

Source: zhihu (original article linked at http://www.36dsj.com/archives/44191).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PhantomJSWeb ScrapingSeleniumHeadersanti-scraping
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.