Backend Development 6 min read

Request Header Spoofing and Anti‑Anti‑Scraping Techniques for Web Crawlers

This article explains how to disguise a web crawler's identity by customizing request headers, managing request frequency with sleep and proxy settings, and tackling common anti‑scraping mechanisms such as captchas, dynamic loading, and encrypted content using tools like Selenium.

Python Programming Learning Circle

Dec 17, 2020

Request Header Spoofing and Anti‑Anti‑Scraping Techniques for Web Crawlers

Even small, unknown websites often check request headers to verify a visitor's identity, and large sites do it more rigorously; forgetting to set headers can cause a crawler to be blocked, so we need to teach crawlers how to disguise themselves and behave like ordinary users.

Custom Requests Headers

Modify User‑Agent to pretend to be a real browser.

Set Referer to indicate the page you came from, which some sites validate.

Include Cookie data; sometimes sending cookies can "bribe" the server into returning full information.

Inspect real request headers via browser developer tools (F12) and use them in your code.

headers = {
    'Referer':'https://accounts.pixiv.net/login?lang=zh&source=pc&view_type=page&ref=wwwtop_accounts_index',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'
}

r = requests.get("https://segmentfault.com/a/1190000014383966", headers=headers)

Usually only these two headers are sufficient, and it is strongly recommended to set a User‑Agent for every request.

Reducing Main IP Access Frequency

Sleep : pause crawling for a while to lower server load and avoid detection.

IP Proxy : route requests through different proxies; good proxies often cost money.

import time
time.sleep(60)  # pause for 60 seconds

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080'
}

r = requests.get(url, headers=headers, proxies=proxies)

Anti‑Anti‑Scraping (Brief Analysis)

Even after perfect header spoofing, you may still fail to obtain the correct page because of advanced anti‑scraping mechanisms that require higher observation and analysis skills.

Random Captcha : pages generate a random code that must be submitted, often found in the source and then sent back.

Obfuscated URLs : URLs contain long, meaningless strings; the usual solution is to use Selenium .

Encrypted/Messy Source : the needed data is present but hidden; you must deduce how to extract it.

Dynamic Loading : additional content loads after interaction; use Selenium or manually capture packets to find target links.

Ajax Technology : asynchronous loading causes only the initial HTML to be retrieved; again Selenium or packet analysis is required.

Note: Selenium can simulate a real browser and is powerful but relatively slow.

In summary, header spoofing follows clear patterns—just add the appropriate headers and code snippets—while anti‑anti‑scraping techniques are flexible and require time‑consuming analysis to devise custom solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

request headers anti-scraping proxies

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.