Backend Development 9 min read

XPath Basics and Web Scraping with Python lxml: Concepts, Syntax, and Practical Examples

This tutorial explains the fundamental concepts and parsing principles of XPath, shows how to set up the Python lxml environment, demonstrates instantiating etree objects, details XPath expression syntax, and provides multiple real‑world web‑scraping examples with complete code snippets.

Python Programming Learning Circle

Mar 8, 2022

XPath Basics and Web Scraping with Python lxml: Concepts, Syntax, and Practical Examples

XPath is a widely used, concise, and efficient parsing method for HTML and XML documents, offering strong versatility for data extraction.

To parse with XPath in Python, you first instantiate an etree object by loading the page source, then call its xpath method with an appropriate expression to locate tags and capture content.

Environment installation pip install lxml How to instantiate an etree object: from lxml import etree Load a local HTML file into the etree object: etree.parse(filePath) Or load raw HTML text obtained from the internet: etree.HTML('page_text') XPath expression basics / – start from the root node (single level). // – select nodes at any depth (multiple levels).

Attribute selection: //div[@class='song'] or tag[@attrName='attrValue'].

Index selection: //div[@class='song']/p[3] (indices start at 1).

Text extraction: /text() for direct text, //text() for all descendant text.

Attribute extraction: /@attrName (e.g., img/@src).

Example 1 – Scrape second‑hand house listings from 58.com

from lxml import etree
import requests

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://xa.58.com/ershoufang/'
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//section[@class="list"]/div')
    fp = open('./58同城二手房.txt', 'w', encoding='utf-8')
    for div in div_list:
        title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
        print(title)
        fp.write(title + '

')

Example 2 – Download images from pic.netbian.com

import requests, os
from lxml import etree

if __name__ == '__main__':
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://pic.netbian.com/4kmeinv/'
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
    if not os.path.exists('./piclibs'):
        os.mkdir('./piclibs')
    for li in li_list:
        detail_url = 'https://pic.netbian.com' + li.xpath('./img/@src')[0]
        detail_name = li.xpath('./img/@alt')[0] + '.jpg'
        detail_name = detail_name.encode('iso-8859-1').decode('GBK')
        detail_path = './piclibs/' + detail_name
        detail_data = requests.get(url=detail_url, headers=headers).content
        with open(detail_path, 'wb') as fp:
            fp.write(detail_data)
        print(detail_name, 'success!!')

Example 3 – Retrieve city names from aqistudy.cn

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://www.aqistudy.cn/historydata/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }
    page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
    fp = open('./citys.txt', 'w', encoding='utf-8')
    i = 0
    for a in a_list:
        city_name = a.xpath('.//a/text()')[0]
        fp.write(city_name + '\t')
        i += 1
        if i == 6:
            i = 0
            fp.write('
')
    print('爬取成功')

Example 4 – Scrape resume templates from sc.chinaz.com

import requests, os
from lxml import etree

if __name__ == '__main__':
    url = 'https://sc.chinaz.com/jianli/free.html'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }
    page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="box col3 ws_block"]/a')
    if not os.path.exists('./简历模板'):
        os.mkdir('./简历模板')
    for a in a_list:
        detail_url = 'https:' + a.xpath('./@href')[0]
        detail_page_text = requests.get(url=detail_url, headers=headers).content.decode('utf-8')
        detail_tree = etree.HTML(detail_page_text)
        detail_a_list = detail_tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li[1]/a')
        for a in detail_a_list:
            download_name = detail_tree.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0]
            download_url = a.xpath('./@href')[0]
            download_data = requests.get(url=download_url, headers=headers).content
            download_path = './简历模板/' + download_name + '.rar'
            with open(download_path, 'wb') as fp:
                fp.write(download_data)
            print(download_name, 'success!!')

Disclaimer

This article is compiled from online sources; the original author retains copyright. If any content is inaccurate or infringes rights, please contact us for removal or authorization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python Web Scraping XPath lxml

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.