How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages
This guide outlines the main anti‑scraping techniques used by websites—including header validation, user‑behavior monitoring, and dynamic content loading—and provides practical methods such as header spoofing, IP proxy rotation, request throttling, and Selenium/PhantomJS automation to overcome them.
Common Anti‑Scraping Techniques
Websites typically employ three categories of anti‑scraping defenses: inspection of request Headers , monitoring of user behavior (e.g., rapid repeated accesses from the same IP or account), and protection of data loaded via AJAX or other dynamic mechanisms.
Bypassing Header Checks
The most frequent defense is checking the User-Agent and sometimes the Referer header. To evade this, simply copy a real browser’s User‑Agent string into your crawler’s request headers and set the Referer to the target domain when required.
Bypassing Behavior‑Based Checks
Two typical behavior checks are:
Multiple rapid requests to the same page from a single IP address.
Multiple identical actions from the same account within a short time frame.
For the first case, use a pool of proxy IPs: fetch public proxy lists, validate them, store them, and rotate the proxy after a few requests (easily done with requests or urllib2 in Python).
For the second case, introduce random delays of a few seconds between requests, or log out and log back in to reset per‑account limits before continuing.
Handling Dynamic Pages
When data is generated via JavaScript or encrypted AJAX calls, static request simulation often fails. First, analyze network traffic with tools like Firebug or HttpFox to locate AJAX endpoints and their parameters. If the parameters are not encrypted, you can replicate the AJAX request with requests and parse the JSON response.
If the AJAX parameters are encrypted or the site heavily obfuscates its API, resort to a real browser automation framework. Using selenium together with PhantomJS (a headless browser) allows you to execute JavaScript, fill forms, click buttons, and scroll pages just as a human would, thereby bypassing most anti‑scraping measures.
This approach can also solve challenges such as click‑based or sliding CAPTCHAs, form brute‑forcing, and other interactive protections, making it a versatile solution for complex sites.
Source: zhihu (original article linked at http://www.36dsj.com/archives/44191).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
