Backend Development 13 min read

Master Python Web Scraping: From urllib to Scrapy with Real-World Examples

This comprehensive guide walks you through Python web crawling fundamentals, covering request handling, URL encoding, regular expressions, the requests library, XPath parsing, and lxml, complete with code snippets and practical examples to help you build effective scrapers.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Master Python Web Scraping: From urllib to Scrapy with Real-World Examples

Overview of Web Crawlers

A web crawler (spider or robot) is a program that fetches web data, essentially mimicking a human browser. It is used to collect large datasets for analysis, testing, or when third‑party data is unavailable or too expensive.

Why Use Python for Crawling

Rich, mature request and parsing modules; powerful Scrapy framework.

Compared to PHP, Java, C/C++, Python offers concise code and strong library support.

Crawler Types

A. General crawlers (search engines) that obey robots.txt.

B. Custom crawlers written by developers.

Typical Crawling Steps

Identify target URLs.

Send requests and receive responses.

Extract required data from the response content.

Save data and repeat for discovered URLs.

1. urllib.request Module

Import the module:

<code>import urllib.request</code>
<code>from urllib import request</code>

Create a request with custom headers and fetch the page:

<code>req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0 ...'})</code>
<code>res = request.urlopen(req)</code>
<code>html = res.read().decode('utf-8')</code>

Key methods of the response object include read() , geturl() , getcode() , and encoding/decoding helpers.

2. urllib.parse (URL Encoding)

Encode query parameters:

<code>from urllib import parse</code>
<code>query_string = {'wd': '美女'}</code>
<code>encoded = parse.urlencode(query_string)</code>

Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

Other useful functions: quote() and unquote() .

3. re (Regular Expressions)

Find all matches:

<code>import re</code>
<code>matches = re.findall('pattern', html, re.S)</code>

Or compile first:

<code>pattern = re.compile('pattern', re.S)</code>
<code>matches = pattern.findall(html)</code>

Common meta‑characters, greedy vs. non‑greedy matching, and grouping are demonstrated with examples.

4. requests Library

Installation:

<code>pip install requests</code>

GET request example:

<code>import requests
headers = {'User-Agent': 'Mozilla/5.0 ...'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
html = res.text</code>

POST request example:

<code>response = requests.post(url, data=data, headers=headers)</code>

Key parameters: url , params , headers , timeout , etc.

5. XPath Parsing

XPath selects nodes in XML/HTML documents. Example HTML snippet:

<code>&lt;ul class="book_list"&gt;
  &lt;li&gt;
    &lt;title class="book_001"&gt;Harry Potter&lt;/title&gt;
    &lt;author&gt;J K. Rowling&lt;/author&gt;
    &lt;year&gt;2005&lt;/year&gt;
    &lt;price&gt;69.99&lt;/price&gt;
  &lt;/li&gt;
&lt;/ul&gt;</code>

Common XPath expressions:

Find all li nodes: //li

Find title with specific class: //li/title[@class="book_001"]

Get attribute values: //title/@class

Functions like contains() and text() are used to filter nodes.

6. lxml Parsing Library

Import and parse HTML:

<code>from lxml import etree
parse_html = etree.HTML(html)</code>

Execute XPath queries:

<code>result = parse_html.xpath('xpath_expression')</code>

Result is always a list; iterate to extract data.

- END -

regexWeb ScrapingRequestsxpathurlliblxml
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.