How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

This article summarizes common anti‑scraping techniques—including header checks, user‑behavior detection, and dynamic page defenses—and provides practical ways to circumvent them using custom headers, IP proxies, request timing, and tools like Selenium with PhantomJS to simulate real browsers.

21CTO
21CTO
21CTO
How to Bypass Common Anti‑Scraping Measures: Headers, Behavior, and Dynamic Pages

0x01 Common Anti‑Scraping Techniques

These days I was crawling a website that had many anti‑scraping measures, making the crawl difficult; after spending time I managed to bypass them. Here I summarize the various anti‑scraping strategies I have encountered and the corresponding countermeasures.

Crawlers generally consist of data acquisition, processing, and storage; this article focuses only on the data acquisition part.

Websites typically employ anti‑scraping from three angles: request Headers, user behavior, and site structure or data loading methods (e.g., AJAX). The first two are common; the third appears in AJAX‑heavy sites and raises the difficulty.

0x02 Bypassing Header‑Based Anti‑Scraping

Header‑based anti‑scraping is the most common. Many sites check the User‑Agent header, and some also verify the Referer (e.g., hotlink protection). To bypass, simply add or modify these headers in your crawler, copying a real browser’s User‑Agent or setting Referer to the target domain.

0x03 User‑Behavior Based Anti‑Scraping

Some sites monitor user behavior, such as multiple rapid requests from the same IP to the same page or repeated actions from the same account within a short period.

For the first case, using IP proxies solves the problem. You can build a proxy‑scraper to collect public proxy IPs, validate them, and rotate the proxy after a few requests, which is easy to implement with libraries like requests or urllib2.

For the second case, introduce random delays of a few seconds between requests. Some sites with logical flaws can be tricked by logging out and back in after a few requests to bypass per‑account rate limits.

0x04 Dynamic Page Anti‑Scraping

Beyond static pages, many sites deliver data via AJAX or generate it with JavaScript. Use tools like Firebug or HttpFox to analyze network requests; if you can locate the AJAX call, you can replicate it with requests or urllib2 and parse the JSON response.

However, some sites encrypt all AJAX parameters, making it impossible to craft the request manually. In such cases, I use Selenium with PhantomJS to drive a headless browser, executing JavaScript and simulating human interactions (form filling, button clicking, scrolling) to retrieve the data.

This approach can bypass most anti‑scraping mechanisms because it operates as a real browser rather than merely spoofing headers. Selenium + PhantomJS can also handle captchas (click‑type or slide‑type) and brute‑force form submissions, and it is useful in automated penetration testing.

Original source: JianShu URL: http://www.cnblogs.com/bsdr/p/5151891.html
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ProxyWeb ScrapingSeleniumHeadersanti-scraping
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.