Fundamentals 6 min read

XPath Basics and lxml Usage in Python

This article introduces the fundamentals of XPath syntax, common rules, and example expressions, then explains how to use the lxml library in Python for HTML/XML parsing, including practical tips and a complete code example for extracting links and text from a sample document.

Python Programming Learning Circle

Dec 19, 2020

XPath (XML Path Language) is a language for locating information in XML and HTML documents, originally designed for XML but also applicable to HTML.

Common XPath rules include retrieving text with expressions like a/text() or a//text(), selecting nodes using /, //, ., .., accessing attributes with @, and using wildcards such as * or @*. The article provides tables that map these expressions to their descriptions.

Typical XPath examples demonstrate selecting the first book element in a bookstore ( /bookstore/book[1]), the last book ( /bookstore/book[last()]), elements with specific attribute values ( //title[@lang='eng']), and combining node sets ( //book/title | //price).

The lxml library in Python can parse and clean HTML, but it may also modify the markup; the article advises inspecting the corrected HTML with etree.tostring before writing XPath queries. It recommends a workflow of grouping elements, iterating over groups, and extracting data to avoid mismatches, and notes that lxml accepts both bytes and str inputs.

Below is a complete example that parses a small HTML fragment, extracts the href attributes and text of li elements with class item-1, and builds a list of dictionaries containing the URL and title for each item.

from lxml import etree

text = ''' <div> <ul>
            <li class="item-1"><a href="link1.html">first item</a></li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-inactive"><a href="link3.html">third item</a></li>
            <li class="item-1"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a>
            </ul> </div> '''

html = etree.HTML(text)

print(html) # <Element html at 0x...>
print(etree.tostring(html).decode())

# Get href of a under li with class "item-1"
ret1 = html.xpath('//li[@class="item-1"]/a/@href')
print(ret1)

# Get text of a under li with class "item-1"
ret2 = html.xpath("//li[@class='item-1']/a/text()")
print(ret2)

# Build dicts
for i in ret1:
    item = {}
    item['url'] = i
    item['title'] = ret2[ret1.index(i)]
    print(item)

# Improved version
ret3 = html.xpath('//li[@class="item-1"]')
for i in ret3:
    item = {}
    item['url'] = i.xpath('./a/@href')[0] if len(i.xpath('./a/@href')) else None
    item['title'] = i.xpath('./a/text()')[0] if len(i.xpath('./a/text')) else None
    print(item)

The article concludes with a disclaimer linking to the original source on Juejin.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python XML Web Scraping XPath lxml

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.