Master Web Crawling with Python: From Requests to XPath Extraction
This guide walks you through the fundamentals of building a web crawler in Python, covering how to fetch pages with the Requests library, extract data using regular expressions and XPath, and provides practical code examples for each step.
There are countless complex tutorials about web crawling, but the core idea is simple: download a page’s source code and extract the needed information.
The process consists of two parts: crawling and extraction .
How to Crawl?
The most popular Python library for HTTP requests is Requests , described as “HTTP for Humans”. It supports connection pooling, cookies, file uploads, automatic encoding detection, and internationalized URLs.
Installing Requests
<code>pip install requests</code>All Python packages can be installed with pip . If the command is unavailable, add C:\Python27\Scripts to your PATH.
Using Requests
The basic calls are requests.get() and requests.post() . Additional methods such as head() and delete() exist but are used less often.
Responses provide .text (Unicode) and .content (binary) for output.
<code>import requests
url = 'http://www.baidu.com'
html = requests.get(url)
print(html.text)</code>How to Extract?
Regular Expressions
The most common regex pattern for crawling is (.*?) , where ( ) captures the desired content, .* matches any characters, and ? makes the match non‑greedy.
Typical usage in Python’s re module:
<code>import re
text = '<a href="www.baidu.com">...'
urls = re.findall('<a href = (.*?)>', text, re.S)
for each in urls:
print(each)
html = '''
<html>
<title>Basic Crawler Knowledge</title>
<body>...</body>
</html>
'''
print(re.search('<title>(.*?)</title>', html, re.S).group(1))
pages = 'http://tieba.baidu.com/p/4342201077?pn=1'
for i in range(10):
print(re.sub('pn=\d', f'pn={i}', pages))
</code>XPath
XPath is a tree‑based query language for XML/HTML that lets you locate nodes precisely, making complex page structures easier to navigate than regex.
Basic syntax includes // for the root, / for child nodes, predicates like [@id='test'] , and functions such as text() or string(.) .
<code>from lxml import etree
html = '''
<div id="test1">content1</div>
<div id="test2">content2</div>
<div id="test3">content3</div>
'''
selector = etree.HTML(html)
content = selector.xpath('//div[starts-with(@id,"test")]/text()')
for each in content:
print(each)
html2 = '''
<div id="class">Hello,<font color=red>my</font> world!</div>
'''
selector = etree.HTML(html2)
info = selector.xpath('//div[@id="class"]')[0]
print(info.xpath('string(.)').replace('\n',''))
</code>These examples demonstrate how to combine Requests for downloading pages with regex or XPath for extracting the data you need.
- END -
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.