Fundamentals 9 min read

Master Python XML Parsing: From Basics to Advanced Node Extraction

This guide explains what XML is, compares it with HTML, and provides step‑by‑step Python code using xml.dom.minidom to read XML files, access nodes, attributes, and extract inner data, helping beginners grasp XML parsing techniques.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Master Python XML Parsing: From Basics to Advanced Node Extraction

Many Python XML tutorials simply attach an XML file and a processing script, which is not helpful for beginners. This article summarizes several practical methods for reading XML files with Python.

What is XML?

XML (eXtensible Markup Language) is a markup language used to tag data and define data types. It allows users to create their own tags and is a source language for defining custom markup.

Example XML file (abc.xml):

<?xml version="1.0" encoding="utf-8"?>
<catalog>
    <maxid>4</maxid>
    <login username="pytest" passwd='123456'>
        <caption>Python</caption>
        <item id="4">
            <caption>测试</caption>
        </item>
    </login>
    <item id="2">
        <caption>Zope</caption>
    </item>
</catalog>

Structurally, XML resembles HTML, but their purposes differ: HTML focuses on data presentation, while XML is designed for data transport and storage, emphasizing content over appearance.

Key characteristics of XML:

Elements are defined by tag pairs, e.g., <aa></aa>.

Tags can have attributes, e.g., <aa id='123'></aa>.

Tag pairs can enclose data, e.g., <aa>abc</aa>.

Tags can be nested to create hierarchical structures.

Reading XML with Python

Below is a basic script using xml.dom.minidom to open and inspect an XML document.

#coding=utf-8
import xml.dom.minidom

# Open the XML file
dom = xml.dom.minidom.parse('abc.xml')

# Get the document element (root)
root = dom.documentElement
print(root.nodeName)
print(root.nodeValue)
print(root.nodeType)
print(root.ELEMENT_NODE)

The xml.dom.minidom module provides the DOM API for XML handling. parse() loads the file into a DOM object, and documentElement returns the root element.

Each node has attributes such as nodeName (the tag name), nodeValue (the text content, valid for text nodes), and nodeType (the node’s type). Common node type constants include:

'ATTRIBUTE_NODE'

'CDATA_SECTION_NODE'

'COMMENT_NODE'

'DOCUMENT_FRAGMENT_NODE'

'DOCUMENT_NODE'

'DOCUMENT_TYPE_NODE'

'ELEMENT_NODE'

'ENTITY_NODE'

'ENTITY_REFERENCE_NODE'

'NOTATION_NODE'

'PROCESSING_INSTRUCTION_NODE'

'TEXT_NODE'

Reference: Node Types – Named Constants

Obtaining Child Elements

To retrieve child tags like maxid or login, use getElementsByTagName:

#coding=utf-8
import xml.dom.minidom

dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement

bb = root.getElementsByTagName('maxid')
b = bb[0]
print(b.nodeName)

bb = root.getElementsByTagName('login')
b = bb[0]
print(b.nodeName)

When multiple tags share the same name (e.g., several caption elements), you can index the returned list:

#coding=utf-8
import xml.dom.minidom

dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement

captions = root.getElementsByTagName('caption')
third_caption = captions[2]
print(third_caption.nodeName)

items = root.getElementsByTagName('item')
second_item = items[1]
print(second_item.nodeName)

Getting Attribute Values

Use getAttribute to read an element’s attribute:

#coding=utf-8
import xml.dom.minidom

dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement

login = root.getElementsByTagName('login')[0]
username = login.getAttribute('username')
print(username)
passwd = login.getAttribute('passwd')
print(passwd)

item = root.getElementsByTagName('item')[0]
id = item.getAttribute('id')
print(id)

Extracting Text Between Tags

Two common approaches retrieve the inner text of an element.

Method 1: Use the firstChild.data property.

#coding=utf-8
import xml.dom.minidom

dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement

captions = dom.getElementsByTagName('caption')
for c in captions:
    print(c.firstChild.data)

Method 2: Use childNodes or getElementsByTagName with additional traversal (e.g., findall in other libraries). This method is more flexible for deeper hierarchies.

These examples demonstrate how to navigate an XML document, access node names, attributes, and inner data using Python’s standard DOM API.

Hope this article helps you work with XML in Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonparsingXMLCode ExampleNodexml.dom.minidom
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.