Master Python XML Parsing: From Basics to Advanced Node Extraction
This guide explains what XML is, compares it with HTML, and provides step‑by‑step Python code using xml.dom.minidom to read XML files, access nodes, attributes, and extract inner data, helping beginners grasp XML parsing techniques.
Many Python XML tutorials simply attach an XML file and a processing script, which is not helpful for beginners. This article summarizes several practical methods for reading XML files with Python.
What is XML?
XML (eXtensible Markup Language) is a markup language used to tag data and define data types. It allows users to create their own tags and is a source language for defining custom markup.
Example XML file (abc.xml):
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<maxid>4</maxid>
<login username="pytest" passwd='123456'>
<caption>Python</caption>
<item id="4">
<caption>测试</caption>
</item>
</login>
<item id="2">
<caption>Zope</caption>
</item>
</catalog>Structurally, XML resembles HTML, but their purposes differ: HTML focuses on data presentation, while XML is designed for data transport and storage, emphasizing content over appearance.
Key characteristics of XML:
Elements are defined by tag pairs, e.g., <aa></aa>.
Tags can have attributes, e.g., <aa id='123'></aa>.
Tag pairs can enclose data, e.g., <aa>abc</aa>.
Tags can be nested to create hierarchical structures.
Reading XML with Python
Below is a basic script using xml.dom.minidom to open and inspect an XML document.
#coding=utf-8
import xml.dom.minidom
# Open the XML file
dom = xml.dom.minidom.parse('abc.xml')
# Get the document element (root)
root = dom.documentElement
print(root.nodeName)
print(root.nodeValue)
print(root.nodeType)
print(root.ELEMENT_NODE)The xml.dom.minidom module provides the DOM API for XML handling. parse() loads the file into a DOM object, and documentElement returns the root element.
Each node has attributes such as nodeName (the tag name), nodeValue (the text content, valid for text nodes), and nodeType (the node’s type). Common node type constants include:
'ATTRIBUTE_NODE'
'CDATA_SECTION_NODE'
'COMMENT_NODE'
'DOCUMENT_FRAGMENT_NODE'
'DOCUMENT_NODE'
'DOCUMENT_TYPE_NODE'
'ELEMENT_NODE'
'ENTITY_NODE'
'ENTITY_REFERENCE_NODE'
'NOTATION_NODE'
'PROCESSING_INSTRUCTION_NODE'
'TEXT_NODE'
Reference: Node Types – Named Constants
Obtaining Child Elements
To retrieve child tags like maxid or login, use getElementsByTagName:
#coding=utf-8
import xml.dom.minidom
dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement
bb = root.getElementsByTagName('maxid')
b = bb[0]
print(b.nodeName)
bb = root.getElementsByTagName('login')
b = bb[0]
print(b.nodeName)When multiple tags share the same name (e.g., several caption elements), you can index the returned list:
#coding=utf-8
import xml.dom.minidom
dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement
captions = root.getElementsByTagName('caption')
third_caption = captions[2]
print(third_caption.nodeName)
items = root.getElementsByTagName('item')
second_item = items[1]
print(second_item.nodeName)Getting Attribute Values
Use getAttribute to read an element’s attribute:
#coding=utf-8
import xml.dom.minidom
dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement
login = root.getElementsByTagName('login')[0]
username = login.getAttribute('username')
print(username)
passwd = login.getAttribute('passwd')
print(passwd)
item = root.getElementsByTagName('item')[0]
id = item.getAttribute('id')
print(id)Extracting Text Between Tags
Two common approaches retrieve the inner text of an element.
Method 1: Use the firstChild.data property.
#coding=utf-8
import xml.dom.minidom
dom = xml.dom.minidom.parse('abc.xml')
root = dom.documentElement
captions = dom.getElementsByTagName('caption')
for c in captions:
print(c.firstChild.data)Method 2: Use childNodes or getElementsByTagName with additional traversal (e.g., findall in other libraries). This method is more flexible for deeper hierarchies.
These examples demonstrate how to navigate an XML document, access node names, attributes, and inner data using Python’s standard DOM API.
Hope this article helps you work with XML in Python.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
