Master Python Web Scraping with requests_html: A Step-by-Step Guide
Learn how to overcome anti‑scraping defenses by using Python's requests_html library to fetch and parse dynamic web pages, with a complete code example that extracts company names from a target site, plus tips on handling cookies, headers, and Unicode decoding.
Hello, I'm a Python enthusiast. Recently a fan asked a detailed question about Python web crawling, which I will address.
1. Idea
Many websites block simple requests calls. The usual options are to locate a JavaScript API or to use a tool that can render JavaScript. Here we choose the latter, using the requests_html library.
2. Analysis
Direct requests requests returned mismatched HTML, indicating anti‑scraping measures. Adding custom User‑Agent and headers did not help, so we switched to requests_html for rendering.
3. Code
Below is the full scraping script.
# 作者:@有点意思
import re
import requests_html
def 抓取源码(url):
user_agent = requests_html.user_agent()
session = requests_html.HTMLSession()
headers = {
"cookie": "...",
"User-Agent": user_agent
}
r = session.get(url, headers=headers)
html = r.html.html
return html # 注意!这里抓取到的源码和手动打开的页面源码不一样
def 解密(列表): # unicode转化成汉字
print(列表)
return [eval(i) for i in 列表]
def 解析页面(html):
公司列表 = re.findall(r'titleName":(".*?")', html, re.DOTALL)
# 注意!此处编写正则时,要匹配的源码是函数“抓取源码”得到的html
# 此处正则匹配时一定要把引号带上!否则eval会报错!
return 解密(公司列表)
if __name__ == "__main__":
# 不用抓包,这里的url就是用户搜索时的页面
url = "https://某某查网站/s?q=%E4%B8%8A%E6%B5%B7%E5%99%A8%E6%A2%B0%E5%8E%82&t=0"
html = 抓取源码(url)
print(html)
公司列表 = 解析页面(html)
print(公司列表)The script uses Chinese identifiers as provided by the original author; this does not affect execution.
Running the program prints the extracted fields.
4. Summary
If requests cannot retrieve the desired page, try the requests_html approach shown above. It can render JavaScript‑generated content and bypass simple anti‑scraping measures. Selenium is another option, though slower.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
