Master Web Scraping with Python requests‑html: Install, Basics & Advanced Tips
This tutorial introduces Python's requests‑html library, covering installation, basic page fetching, link extraction, element selection with CSS and XPath, rendering JavaScript, pagination, direct HTML usage, custom request options, form login, and practical crawling examples.
Python has a famous HTTP library called requests , and its author has released a new library named requests‑html , which combines HTTP requests with HTML parsing in a single, easy‑to‑use package.
Installation
Install requests‑html with a single command. It requires Python 3.6 or newer because it uses type annotations introduced in that version.
pip install requests-htmlBasic Usage
Fetching a page
The library integrates HTTP fetching and HTML parsing, so you can retrieve a page and immediately work with its DOM.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html) # raw HTMLGetting Links
The links and absolute_links attributes return all links and absolute links found in the page, excluding anchors.
# Get links
print(r.html.links)
print(r.html.absolute_links)Finding Elements
You can select elements using CSS selectors via find or XPath via xpath. The find method accepts five parameters:
selector – CSS selector string
clean – boolean, removes style and script tags when true
containing – returns elements containing the given text
first – boolean, returns only the first match when true
_encoding – encoding format
# CSS selector example
print(r.html.find('div#menu', first=True).text)
print(r.html.find('div#menu a'))
print(list(map(lambda x: x.text, r.html.find('div.content span'))))XPath works similarly, with parameters selector, clean, first, and _encoding:
print(r.html.xpath("//div[@id='menu']", first=True).text)
print(r.html.xpath("//div[@id='menu']/a"))
print(r.html.xpath("//div[@class='content']/span/text()"))Element Content
For a specific element you can access its text, attributes, and raw HTML:
e = r.html.find('div#hd_logo', first=True)
print(e.text) # element text
print(e.attrs) # attribute dict
print(e.html) # raw HTML of the element
print(e.search('糗事{}科')[0]) # regex search example
print(e.absolute_links)
print(e.links)Advanced Usage
JavaScript Support
Pages rendered by JavaScript can be processed by calling r.html.render(), which downloads a Chromium binary (via pyppeteer) on first use.
r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.search('Python 2 will retire in only {months} months!')['months'])The render method accepts parameters such as retries, script, wait, scrolldown, sleep, reload, and keep_page to control rendering behavior.
Smart Pagination
Iterating over paginated results is straightforward using the iterator protocol:
r = session.get('https://reddit.com')
for html in r.html:
print(html)
# Get next page URL
print(r.html.next())Direct HTML Usage
You can create an HTML object from a string without making a network request:
from requests_html import HTML
doc = "<a href='https://httpbin.org'>link</a>"
html = HTML(html=doc)
print(html.links)Custom Requests
All session methods accept **kwargs to pass extra arguments to the underlying requests call, allowing custom headers, proxies, etc.
# Change User-Agent
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'user-agent': ua})
print(r.html.html)Form Login Simulation
POST requests can be used to submit forms:
r = session.post('http://httpbin.org/post', data={'username':'yitian','passwd':123456})
print(r.html.html)Crawling Examples
Below are concise scripts demonstrating real‑world scraping tasks.
Scrape Jianshu User Articles
r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=0.2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")Scrape Tianya Forum Thread
# Get author name and total pages
url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
pages_div = r.html.find('div.atl-pages', first=True)
links = pages_div.find('a')
total_page = 1 if not links else int(links[-2].text)
title = r.html.find('span.s_title span', first=True).text
with open(f"{title}.txt", 'x', encoding='utf-8') as f:
for i in range(1, total_page+1):
page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
r = session.get(page_url)
items = r.html.find(f"div.atl-item[_host={author}]")
for item in items:
content = item.find('div.bbs-content', first=True).text
if not content.startswith('@'):
f.write(content + "
")Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
