Backend Development 18 min read

Master Web Crawling in Python: From urllib to requests and Robots.txt

This guide explains the fundamentals of web crawling, covering crawler types, the Robots.txt protocol, Python's urllib and urllib3 modules, the requests library, handling HTTP methods, user‑agents, HTTPS certificates, and practical code examples for extracting data from websites.

MaGe Linux Operations

Dec 25, 2019

Master Web Crawling in Python: From urllib to requests and Robots.txt

Overview

Web crawlers, also known as spiders or bots, are programs that automatically fetch web pages; search engines are major users of crawlers. In the big‑data era, businesses often need to collect specific site data that search engines cannot provide, so they develop custom crawlers.

Crawler Types

1. General crawlers – like search engines, they collect data indiscriminately, store it, extract keywords, and build indexes. Typical workflow:

Initialize a list of URLs and add them to a crawl queue.

Take URLs from the queue, resolve DNS, download HTML, save locally, then move URLs to a completed list.

Parse pages, discover new URLs, and repeat until stopping criteria are met.

Search engines acquire URLs via site submissions, external links, or cooperation with DNS providers.

2. Focused crawlers – target specific domains or topics, gathering only relevant data.

Robots.txt Protocol

The robots.txt file tells crawlers which parts of a site may be accessed. Directives include: / – the site root (applies to all directories). Allow – directories that can be crawled. Disallow – directories that must not be crawled.

Wildcards can be used for pattern matching.

Example snippets from Taobao and Mafengwo illustrate typical allow/disallow rules.

HTTP Request and Response Handling

Crawling essentially performs HTTP requests programmatically. Python’s standard library provides the urllib package.

urllib Package

Modules: urllib.request – open and read URLs. urllib.error – exceptions from urllib.request. urllib.parse – URL parsing and encoding. urllib.robotparser – parse robots.txt.

In Python 2 there were separate urllib and urllib2; in Python 3 they are merged into urllib.

urllib.request – urlopen

Example:

from urllib.request import urlopen
responses = urlopen("http://www.bing.com")
print(responses.closed)
with responses:
    print(1, type(responses))
    print(2, responses.status, responses.reason)
    print(3, responses.geturl())
    print(4, responses.info())
    print(5, responses.read()[:50])
print(responses.closed)

The returned object behaves like a file; you can read content, inspect status, headers, and final URL.

Custom User‑Agent

To avoid being blocked, set a realistic User-Agent header:

from urllib.request import Request, urlopen
import random
ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 Chrome/57.0.2987.133 Safari/537.36",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 Safari/5.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
]
ua = random.choice(ua_list)
request = Request("http://www.bing.com")
request.add_header("User-Agent", ua)
response = urlopen(request, timeout=20)
with response:
    print(1, response.status, response.getcode(), response.reason)
    print(2, response.geturl())
    print(3, response.info())
    print(4, response.read()[:50])
print(request.get_header("User-agent"))

urllib.request – Request Class

Creates a request object where additional headers can be added:

from urllib.request import Request, urlopen
url = "http://www.bing.com/"
request = Request(url)
request.add_header("User-Agent", "Mozilla/5.0 ...")
response = urlopen(request)
print(type(response))

urllib.parse

Encode query strings and decode them:

from urllib.parse import urlencode, unquote
q = urlencode({"url": "http://www.xdd.com/python", "p_url": "http://www.xdd.com/python?id=1&name=张三"})
print(q)
print(unquote(q))

GET and POST Methods

GET passes data in the URL; POST sends data in the request body.

GET Example

from urllib.request import urlopen, Request
from urllib.parse import urlencode
data = urlencode({"q": "神探狄仁杰"})
url = f"http://cn.bing.com/search?{data}"
request = Request(url, headers={"User-agent": "Mozilla/5.0"})
response = urlopen(request)
with open("d:/abc.html", "wb") as f:
    f.write(response.read())
print("ok")

POST Example

from urllib.request import Request, urlopen
from urllib.parse import urlencode
import simplejson
request = Request("http://httpbin.org/post")
request.add_header("User-agent", "Mozilla/5.0")
data = urlencode({"name": "张三,@=/&", "age": "6"})
res = urlopen(request, data.encode())
with res:
    j = res.read().decode()
    print(j)
    print(simplejson.loads(j))

Handling HTTPS Certificates

When a site uses a self‑signed certificate (e.g., older 12306), Python raises ssl.CertificateError. You can ignore verification:

import ssl
from urllib.request import Request, urlopen
request = Request("https://www.12306.cn/mormhweb/")
request.add_header("User-Agent", "Mozilla/5.0")
context = ssl._create_unverified_context()
res = urlopen(request, context=context)
with res:
    print(res.geturl())
    print(res.read().decode())

urllib3 Library

Third‑party library offering connection pooling and more features than the standard urllib:

import urllib3
url = "https://movie.douban.com"
ua = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 Safari/5.0.1"
with urllib3.PoolManager() as http:
    response = http.request("GET", url, headers={"User-Agent": ua})
    print(type(response))
    print(response.status, response.reason)
    print(response.headers)
    print(response.data[:50])

requests Library

Higher‑level HTTP client built on urllib3:

import requests
ua = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 Chrome/57.0.2987.133 Safari/537.36"
url = "https://movie.douban.com/"
response = requests.request("GET", url, headers={"User-Agent": ua})
print(type(response))
print(response.url)
print(response.status_code)
print(response.headers)
print(response.text[:200])

Using a Session preserves cookies across multiple requests:

import requests
ua = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"
urls = ["https://www.baidu.com/s?wd=xdd", "https://www.baidu.com/s?wd=xdd"]
session = requests.Session()
for url in urls:
    resp = session.get(url, headers={"User-Agent": ua})
    print(resp.request.headers)
    print(resp.cookies)
    print(resp.text[:20])
    print("-"*30)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python requests urllib robots.txt

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.