Master Python’s urllib: From Basics to Advanced Web Scraping
Learn how to use Python’s built-in urllib library for web requests, handling GET/POST, adding headers, managing proxies, processing cookies, handling errors, parsing URLs, and respecting robots.txt, with clear code examples and a practical case of scraping a novel site.
This article explains the common usage of Python’s built‑in urllib library, covering its definition, main modules, and a practical urllib + lxml crawling example.
What is urllib
urllib is a standard Python library for HTTP requests that requires no installation. It provides functions for web requests, response retrieval, proxy and cookie settings, exception handling, and URL parsing.
urllib Modules
The library consists of the following modules:
urllib.request # request module
urllib.error # error handling module
urllib.parse # parsing module
urllib.robotparser # robot parser moduleBelow are typical usages of these modules.
Basic Request
To open a URL you can use:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)The simplest GET request reads the response content: urllib.request.urlopen(url, data, timeout) Parameters:
url – request address
data – request payload
timeout – request timeout
For a POST request, provide the data argument (as bytes) and set method='POST':
from urllib import request, parse
url = 'https://book.qidian.com/info/1014243481#Catalog'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'Host': 'book.qidian.com'
}
data = {'hw': 'hw'}
data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url=url, data=data, timeout=2, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))Adding Headers
You can also add headers after creating the request object:
from urllib import request, parse
url = 'https://book.qidian.com/info/1014243481#Catalog'
data = {'hw': 'hw'}
data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36')
response = request.urlopen(req)
print(response.read().decode('utf-8'))Using Proxies
Proxies can be set to avoid IP blocking:
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
'http': 'http://127.0.0.1:8000',
'https': 'https://127.0.0.1:8000'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('https://book.qidian.com/info/1014243481#Catalog')
print(response.read())Error Handling
Typical exceptions are URLError and HTTPError (a subclass of URLError).
import socket
import urllib.request
import urllib.error
aa = ''
try:
response = urllib.request.urlopen('https://book.qidian.com/info/1014243481#Catalog', timeout=0.1)
aa = response.read().decode('utf8')
except urllib.error.URLError as e:
print(e.reason)
if isinstance(e.reason, socket.timeout):
print('time out')
except urllib.error.HTTPError as e:
print(e.reason, e.code)
finally:
print(aa)Cookie Management
Cookies can be handled via http.cookiejar and HTTPCookieProcessor:
import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(file_name)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)Loading saved cookies:
import http.cookiejar, urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))URL Parsing (urllib.parse)
The urllib.parse module parses and constructs URLs: urllib.parse.urlparse(url, scheme) To rebuild a URL:
urllib.parse.urlunpars(url, scheme)Robot Parsing (urllib.robotparser)
The robot parser reads robots.txt to determine if a site permits crawling:
from urllib import robotparser
rb = robotparser.RobotFileParser('https://www.baidu.com/robots.txt')
print(rb.read())
url = 'https://www.baidu.com'
user_agent = 'BadCrawler'
print(rb.can_fetch(user_agent, url)) # False
user_agent = 'Googlebot'
print(rb.can_fetch(user_agent, url)) # TrueApplication Example: Scraping Qidian Novel Titles
Using the browser’s developer tools (F12) and lxml, you can locate the HTML element that contains the novel titles and extract them. The urllib part handles the HTTP request, while lxml processes the HTML.
Further tutorials will cover lxml and XPath syntax for more advanced data extraction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
