Fundamentals 12 min read

Master Python’s urllib: From Basics to Advanced Web Scraping

Learn how to use Python’s built-in urllib library for web requests, handling GET/POST, adding headers, managing proxies, processing cookies, handling errors, parsing URLs, and respecting robots.txt, with clear code examples and a practical case of scraping a novel site.

Python Crawling & Data Mining

Dec 12, 2020

Master Python’s urllib: From Basics to Advanced Web Scraping

This article explains the common usage of Python’s built‑in urllib library, covering its definition, main modules, and a practical urllib + lxml crawling example.

What is urllib

urllib is a standard Python library for HTTP requests that requires no installation. It provides functions for web requests, response retrieval, proxy and cookie settings, exception handling, and URL parsing.

urllib Modules

The library consists of the following modules:

urllib.request        # request module
urllib.error          # error handling module
urllib.parse          # parsing module
urllib.robotparser    # robot parser module

Below are typical usages of these modules.

Basic Request

To open a URL you can use:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

The simplest GET request reads the response content: urllib.request.urlopen(url, data, timeout) Parameters:

url – request address

data – request payload

timeout – request timeout

For a POST request, provide the data argument (as bytes) and set method='POST':

from urllib import request, parse
url = 'https://book.qidian.com/info/1014243481#Catalog'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'Host': 'book.qidian.com'
}
data = {'hw': 'hw'}
data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url=url, data=data, timeout=2, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Adding Headers

You can also add headers after creating the request object:

from urllib import request, parse
url = 'https://book.qidian.com/info/1014243481#Catalog'
data = {'hw': 'hw'}
data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Using Proxies

Proxies can be set to avoid IP blocking:

import urllib.request
proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:8000',
    'https': 'https://127.0.0.1:8000'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('https://book.qidian.com/info/1014243481#Catalog')
print(response.read())

Error Handling

Typical exceptions are URLError and HTTPError (a subclass of URLError).

import socket
import urllib.request
import urllib.error
aa = ''
try:
    response = urllib.request.urlopen('https://book.qidian.com/info/1014243481#Catalog', timeout=0.1)
    aa = response.read().decode('utf8')
except urllib.error.URLError as e:
    print(e.reason)
    if isinstance(e.reason, socket.timeout):
        print('time out')
except urllib.error.HTTPError as e:
    print(e.reason, e.code)
finally:
    print(aa)

Cookie Management

Cookies can be handled via http.cookiejar and HTTPCookieProcessor:

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(file_name)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

Loading saved cookies:

import http.cookiejar, urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

URL Parsing (urllib.parse)

The urllib.parse module parses and constructs URLs: urllib.parse.urlparse(url, scheme) To rebuild a URL:

urllib.parse.urlunpars(url, scheme)

Robot Parsing (urllib.robotparser)

The robot parser reads robots.txt to determine if a site permits crawling:

from urllib import robotparser
rb = robotparser.RobotFileParser('https://www.baidu.com/robots.txt')
print(rb.read())
url = 'https://www.baidu.com'
user_agent = 'BadCrawler'
print(rb.can_fetch(user_agent, url))  # False
user_agent = 'Googlebot'
print(rb.can_fetch(user_agent, url))  # True

Application Example: Scraping Qidian Novel Titles

Using the browser’s developer tools (F12) and lxml, you can locate the HTML element that contains the novel titles and extract them. The urllib part handles the HTTP request, while lxml processes the HTML.

Further tutorials will cover lxml and XPath syntax for more advanced data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy Python cookies HTTP requests urllib robots.txt

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.