Backend Development 9 min read

Master Web Crawling with Python: From Requests to XPath Extraction

This guide walks you through the fundamentals of building a web crawler in Python, covering how to fetch pages with the Requests library, extract data using regular expressions and XPath, and provides practical code examples for each step.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Master Web Crawling with Python: From Requests to XPath Extraction

There are countless complex tutorials about web crawling, but the core idea is simple: download a page’s source code and extract the needed information.

The process consists of two parts: crawling and extraction .

How to Crawl?

The most popular Python library for HTTP requests is Requests , described as “HTTP for Humans”. It supports connection pooling, cookies, file uploads, automatic encoding detection, and internationalized URLs.

Installing Requests

<code>pip install requests</code>

All Python packages can be installed with pip . If the command is unavailable, add C:\Python27\Scripts to your PATH.

Using Requests

The basic calls are requests.get() and requests.post() . Additional methods such as head() and delete() exist but are used less often.

Responses provide .text (Unicode) and .content (binary) for output.

<code>import requests
url = 'http://www.baidu.com'
html = requests.get(url)
print(html.text)</code>

How to Extract?

Regular Expressions

The most common regex pattern for crawling is (.*?) , where ( ) captures the desired content, .* matches any characters, and ? makes the match non‑greedy.

Typical usage in Python’s re module:

<code>import re
text = '<a href="www.baidu.com">...'
urls = re.findall('&lt;a href = (.*?)&gt;', text, re.S)
for each in urls:
    print(each)

html = '''
<html>
<title>Basic Crawler Knowledge</title>
<body>...</body>
</html>
'''
print(re.search('&lt;title&gt;(.*?)&lt;/title&gt;', html, re.S).group(1))

pages = 'http://tieba.baidu.com/p/4342201077?pn=1'
for i in range(10):
    print(re.sub('pn=\d', f'pn={i}', pages))
</code>

XPath

XPath is a tree‑based query language for XML/HTML that lets you locate nodes precisely, making complex page structures easier to navigate than regex.

Basic syntax includes // for the root, / for child nodes, predicates like [@id='test'] , and functions such as text() or string(.) .

<code>from lxml import etree
html = '''
<div id="test1">content1</div>
<div id="test2">content2</div>
<div id="test3">content3</div>
'''
selector = etree.HTML(html)
content = selector.xpath('//div[starts-with(@id,"test")]/text()')
for each in content:
    print(each)

html2 = '''
<div id="class">Hello,<font color=red>my</font> world!</div>
'''
selector = etree.HTML(html2)
info = selector.xpath('//div[@id="class"]')[0]
print(info.xpath('string(.)').replace('\n',''))
</code>

These examples demonstrate how to combine Requests for downloading pages with regex or XPath for extracting the data you need.

- END -

Data Extractionregexweb crawlingRequestsxpath
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.