Backend Development 6 min read

Bypass Anti‑Scraping Measures with Python’s requests_html

This article walks through a real‑world Python web‑scraping case, explaining why standard requests failed, how the requests_html library can overcome anti‑scraping defenses, and provides a complete, runnable code example with analysis and results.

Python Crawling & Data Mining

Jan 31, 2022

Bypass Anti‑Scraping Measures with Python’s requests_html

Idea

Many websites implement anti‑scraping mechanisms that block simple requests calls. When faced with such a site, you can either locate a hidden JavaScript API or switch to a more capable tool like requests_html, which renders JavaScript and simplifies extraction.

Analysis

The initial attempt using plain requests returned HTML that differed drastically from the page source seen in a browser. Adding custom User‑Agent strings and headers did not help, indicating that the server detects and blocks non‑browser requests. Switching to requests_html allowed the page to be rendered correctly, exposing the desired data.

Code

# 作者：@有点意思
import re
import requests_html

def 抓取源码(url):
    user_agent = requests_html.user_agent()
    session = requests_html.HTMLSession()
    headers = {
        "cookie": "...",
        "User-Agent": user_agent
    }
    r = session.get(url, headers=headers)
    html = r.html.html
    return html  # 注意！这里抓取到的源码和手动打开的页面源码不一样

def 解密(列表):  # unicode转化成汉字
    print(列表)
    return [eval(i) for i in 列表]

def 解析页面(html):
    公司列表 = re.findall(r'titleName":(".*?")', html, re.DOTALL)
    # 注意！此处编写正则时，要匹配的源码是函数“抓取源码”得到的html
    # 此处正则匹配时一定要把引号带上！否则eval会报错！
    return 解密(公司列表)

if __name__ == "__main__":
    # 不用抓包，这里的url就是用户搜索时的页面
    url = "https://某某查网站/s?q=%E4%B8%8A%E6%B5%B7%E5%99%A8%E6%A2%B0%E5%8E%82&t=0"
    html = 抓取源码(url)
    print(html)
    公司列表 = 解析页面(html)
    print(公司列表)

The script uses Chinese identifiers as requested by the original author; this does not affect execution.

Conclusion

The author demonstrates a practical solution for sites that block requests, recommending the requests_html approach and noting that Selenium is an alternative albeit slower method. Readers are encouraged to try the code and adapt it to similar scraping challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

anti-scraping requests_html

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.