Backend Development 6 min read

Master Python Web Scraping with requests_html: A Step-by-Step Guide

Learn how to overcome anti‑scraping defenses by using Python's requests_html library to fetch and parse dynamic web pages, with a complete code example that extracts company names from a target site, plus tips on handling cookies, headers, and Unicode decoding.

Python Crawling & Data Mining

Sep 22, 2024

Master Python Web Scraping with requests_html: A Step-by-Step Guide

Hello, I'm a Python enthusiast. Recently a fan asked a detailed question about Python web crawling, which I will address.

1. Idea

Many websites block simple requests calls. The usual options are to locate a JavaScript API or to use a tool that can render JavaScript. Here we choose the latter, using the requests_html library.

2. Analysis

Direct requests requests returned mismatched HTML, indicating anti‑scraping measures. Adding custom User‑Agent and headers did not help, so we switched to requests_html for rendering.

3. Code

Below is the full scraping script.

# 作者：@有点意思
import re
import requests_html

def 抓取源码(url):
    user_agent = requests_html.user_agent()
    session = requests_html.HTMLSession()
    headers = {
        "cookie": "...",
        "User-Agent": user_agent
    }
    r = session.get(url, headers=headers)
    html = r.html.html
    return html  # 注意！这里抓取到的源码和手动打开的页面源码不一样

def 解密(列表):  # unicode转化成汉字
    print(列表)
    return [eval(i) for i in 列表]

def 解析页面(html):
    公司列表 = re.findall(r'titleName":(".*?")', html, re.DOTALL)
    # 注意！此处编写正则时，要匹配的源码是函数“抓取源码”得到的html
    # 此处正则匹配时一定要把引号带上！否则eval会报错！
    return 解密(公司列表)

if __name__ == "__main__":
    # 不用抓包，这里的url就是用户搜索时的页面
    url = "https://某某查网站/s?q=%E4%B8%8A%E6%B5%B7%E5%99%A8%E6%A2%B0%E5%8E%82&t=0"
    html = 抓取源码(url)
    print(html)
    公司列表 = 解析页面(html)
    print(公司列表)

The script uses Chinese identifiers as provided by the original author; this does not affect execution.

Running the program prints the extracted fields.

4. Summary

If requests cannot retrieve the desired page, try the requests_html approach shown above. It can render JavaScript‑generated content and bypass simple anti‑scraping measures. Selenium is another option, though slower.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

anti-scraping requests-html web-scraping data-mining

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.