Backend Development 17 min read

Master Web Scraping with Python requests‑html: Install, Basics & Advanced Tips

This tutorial introduces Python's requests‑html library, covering installation, basic page fetching, link extraction, element selection with CSS and XPath, rendering JavaScript, pagination, direct HTML usage, custom request options, form login, and practical crawling examples.

MaGe Linux Operations

Sep 18, 2020

Master Web Scraping with Python requests‑html: Install, Basics & Advanced Tips

Python has a famous HTTP library called requests , and its author has released a new library named requests‑html , which combines HTTP requests with HTML parsing in a single, easy‑to‑use package.

Installation

Install requests‑html with a single command. It requires Python 3.6 or newer because it uses type annotations introduced in that version.

pip install requests-html

Basic Usage

Fetching a page

The library integrates HTTP fetching and HTML parsing, so you can retrieve a page and immediately work with its DOM.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.qiushibaike.com/text/')
print(r.html.html)  # raw HTML

Getting Links

The links and absolute_links attributes return all links and absolute links found in the page, excluding anchors.

# Get links
print(r.html.links)
print(r.html.absolute_links)

Finding Elements

You can select elements using CSS selectors via find or XPath via xpath. The find method accepts five parameters:

selector – CSS selector string

clean – boolean, removes style and script tags when true

containing – returns elements containing the given text

first – boolean, returns only the first match when true

_encoding – encoding format

# CSS selector example
print(r.html.find('div#menu', first=True).text)
print(r.html.find('div#menu a'))
print(list(map(lambda x: x.text, r.html.find('div.content span'))))

XPath works similarly, with parameters selector, clean, first, and _encoding:

print(r.html.xpath("//div[@id='menu']", first=True).text)
print(r.html.xpath("//div[@id='menu']/a"))
print(r.html.xpath("//div[@class='content']/span/text()"))

Element Content

For a specific element you can access its text, attributes, and raw HTML:

e = r.html.find('div#hd_logo', first=True)
print(e.text)          # element text
print(e.attrs)         # attribute dict
print(e.html)          # raw HTML of the element
print(e.search('糗事{}科')[0])  # regex search example
print(e.absolute_links)
print(e.links)

Advanced Usage

JavaScript Support

Pages rendered by JavaScript can be processed by calling r.html.render(), which downloads a Chromium binary (via pyppeteer) on first use.

r = session.get('http://python-requests.org/')
r.html.render()
print(r.html.search('Python 2 will retire in only {months} months!')['months'])

The render method accepts parameters such as retries, script, wait, scrolldown, sleep, reload, and keep_page to control rendering behavior.

Smart Pagination

Iterating over paginated results is straightforward using the iterator protocol:

r = session.get('https://reddit.com')
for html in r.html:
    print(html)
# Get next page URL
print(r.html.next())

Direct HTML Usage

You can create an HTML object from a string without making a network request:

from requests_html import HTML
doc = "<a href='https://httpbin.org'>link</a>"
html = HTML(html=doc)
print(html.links)

Custom Requests

All session methods accept **kwargs to pass extra arguments to the underlying requests call, allowing custom headers, proxies, etc.

# Change User-Agent
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/62.0'
r = session.get('http://httpbin.org/get', headers={'user-agent': ua})
print(r.html.html)

Form Login Simulation

POST requests can be used to submit forms:

r = session.post('http://httpbin.org/post', data={'username':'yitian','passwd':123456})
print(r.html.html)

Crawling Examples

Below are concise scripts demonstrating real‑world scraping tasks.

Scrape Jianshu User Articles

r = session.get('https://www.jianshu.com/u/7753478e1554')
r.html.render(scrolldown=50, sleep=0.2)
titles = r.html.find('a.title')
for i, title in enumerate(titles):
    print(f"{i+1} [{title.text}](https://www.jianshu.com{title.attrs['href']})")

Scrape Tianya Forum Thread

# Get author name and total pages
url = 'http://bbs.tianya.cn/post-culture-488321-1.shtml'
r = session.get(url)
author = r.html.find('div.atl-info span a', first=True).text
pages_div = r.html.find('div.atl-pages', first=True)
links = pages_div.find('a')
total_page = 1 if not links else int(links[-2].text)
title = r.html.find('span.s_title span', first=True).text
with open(f"{title}.txt", 'x', encoding='utf-8') as f:
    for i in range(1, total_page+1):
        page_url = f"{url.rsplit('-', 1)[0]}-{i}.shtml"
        r = session.get(page_url)
        items = r.html.find(f"div.atl-item[_host={author}]")
        for item in items:
            content = item.find('div.bbs-content', first=True).text
            if not content.startswith('@'):
                f.write(content + "
")

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python tutorial requests-html

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.