XPath Basics and Web Scraping with Python lxml: Concepts, Syntax, and Practical Examples
This tutorial explains the fundamental concepts and parsing principles of XPath, shows how to set up the Python lxml environment, demonstrates instantiating etree objects, details XPath expression syntax, and provides multiple real‑world web‑scraping examples with complete code snippets.
XPath is a widely used, concise, and efficient parsing method for HTML and XML documents, offering strong versatility for data extraction.
To parse with XPath in Python, you first instantiate an etree object by loading the page source, then call its xpath method with an appropriate expression to locate tags and capture content.
Environment installation
<code>pip install lxml</code>How to instantiate an etree object:
<code>from lxml import etree</code>Load a local HTML file into the etree object:
<code>etree.parse(filePath)</code>Or load raw HTML text obtained from the internet:
<code>etree.HTML('page_text')</code>XPath expression basics
/ – start from the root node (single level).
// – select nodes at any depth (multiple levels).
Attribute selection: //div[@class='song'] or tag[@attrName='attrValue'] .
Index selection: //div[@class='song']/p[3] (indices start at 1).
Text extraction: /text() for direct text, //text() for all descendant text.
Attribute extraction: /@attrName (e.g., img/@src ).
Example 1 – Scrape second‑hand house listings from 58.com
<code>from lxml import etree
import requests
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://xa.58.com/ershoufang/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//section[@class="list"]/div')
fp = open('./58同城二手房.txt', 'w', encoding='utf-8')
for div in div_list:
title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
print(title)
fp.write(title + '\n\n')
</code>Example 2 – Download images from pic.netbian.com
<code>import requests, os
from lxml import etree
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
url = 'https://pic.netbian.com/4kmeinv/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
if not os.path.exists('./piclibs'):
os.mkdir('./piclibs')
for li in li_list:
detail_url = 'https://pic.netbian.com' + li.xpath('./img/@src')[0]
detail_name = li.xpath('./img/@alt')[0] + '.jpg'
detail_name = detail_name.encode('iso-8859-1').decode('GBK')
detail_path = './piclibs/' + detail_name
detail_data = requests.get(url=detail_url, headers=headers).content
with open(detail_path, 'wb') as fp:
fp.write(detail_data)
print(detail_name, 'success!!')
</code>Example 3 – Retrieve city names from aqistudy.cn
<code>import requests
from lxml import etree
if __name__ == '__main__':
url = 'https://www.aqistudy.cn/historydata/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
fp = open('./citys.txt', 'w', encoding='utf-8')
i = 0
for a in a_list:
city_name = a.xpath('.//a/text()')[0]
fp.write(city_name + '\t')
i += 1
if i == 6:
i = 0
fp.write('\n')
print('爬取成功')
</code>Example 4 – Scrape resume templates from sc.chinaz.com
<code>import requests, os
from lxml import etree
if __name__ == '__main__':
url = 'https://sc.chinaz.com/jianli/free.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
tree = etree.HTML(page_text)
a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
if not os.path.exists('./简历模板'):
os.mkdir('./简历模板')
for a in a_list:
detail_url = 'https:' + a.xpath('./@href')[0]
detail_page_text = requests.get(url=detail_url, headers=headers).content.decode('utf-8')
detail_tree = etree.HTML(detail_page_text)
detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
for a in detail_a_list:
download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
download_url = a.xpath('./@href')[0]
download_data = requests.get(url=download_url, headers=headers).content
download_path = './简历模板/' + download_name + '.rar'
with open(download_path, 'wb') as fp:
fp.write(download_data)
print(download_name, 'success!!')
</code>Disclaimer
This article is compiled from online sources; the original author retains copyright. If any content is inaccurate or infringes rights, please contact us for removal or authorization.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.