Backend Development 13 min read

Comprehensive Guide to Python urllib Library: Modules, Functions, and Usage Examples

This article provides a detailed tutorial on Python's urllib library, covering its main modules (request, error, parse, robotparser), key functions and classes, code examples for URL fetching, parsing, encoding, and handling robots.txt, making it a practical resource for backend developers and web scrapers.

Python Programming Learning Circle

Nov 12, 2022

Comprehensive Guide to Python urllib Library: Modules, Functions, and Usage Examples

Python's urllib library provides tools for handling URLs and fetching web content.

The library consists of several modules: urllib.request for opening and reading URLs, urllib.error for handling exceptions, urllib.parse for parsing and constructing URLs, and urllib.robotparser for interpreting robots.txt files.

urllib.request offers functions such as urlopen and the Request class, allowing custom headers, authentication, and timeout settings. Example:

import urllib.request
url = urllib.request.urlopen("https://www.baidu.com")
print(url.read().decode('utf-8'))

Common methods of the response object include read(), readline(), info(), getcode(), and geturl().

urllib.error defines URLError and HTTPError exceptions, where URLError indicates network issues and HTTPError represents HTTP status errors.

Example handling:

from urllib import request, error
try:
    response = request.urlopen("http://invalid.url")
except error.URLError as e:
    print(e.reason)
except error.HTTPError as e:
    print(e.code)

urllib.parse provides functions for URL parsing ( urlparse, urlsplit) and construction ( urlunparse, urlunsplit), as well as encoding utilities ( quote, urlencode, unquote). Example parsing:

from urllib.parse import urlparse
o = urlparse("https://docs.python.org/3/library/urllib.parse.html")
print('scheme:', o.scheme)
print('netloc:', o.netloc)

Encoding a query string:

from urllib import parse
query = parse.urlencode({'wd':'爬虫'})
url = f"http://www.baidu.com/s?{query}"
print(url)

urllib.robotparser parses robots.txt files to determine crawling permissions. It offers methods such as set_url, read, can_fetch, and others for managing crawl policies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

networking urllib web-scraping urlparse

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.