Backend Development 13 min read

Master Python Web Scraping: From urllib to Scrapy with Real-World Examples

This comprehensive guide walks you through Python web crawling fundamentals, covering request handling, URL encoding, regular expressions, the requests library, XPath parsing, and lxml, complete with code snippets and practical examples to help you build effective scrapers.

Python Programming Learning Circle

Jan 7, 2020

Master Python Web Scraping: From urllib to Scrapy with Real-World Examples

Overview of Web Crawlers

A web crawler (spider or robot) is a program that fetches web data, essentially mimicking a human browser. It is used to collect large datasets for analysis, testing, or when third‑party data is unavailable or too expensive.

Why Use Python for Crawling

Rich, mature request and parsing modules; powerful Scrapy framework.

Compared to PHP, Java, C/C++, Python offers concise code and strong library support.

Crawler Types

A. General crawlers (search engines) that obey robots.txt.

B. Custom crawlers written by developers.

Typical Crawling Steps

Identify target URLs.

Send requests and receive responses.

Extract required data from the response content.

Save data and repeat for discovered URLs.

1. urllib.request Module

Import the module:

import urllib.request

from urllib import request

Create a request with custom headers and fetch the page:

req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0 ...'})

res = request.urlopen(req)

html = res.read().decode('utf-8')

Key methods of the response object include read(), geturl(), getcode(), and encoding/decoding helpers.

2. urllib.parse (URL Encoding)

Encode query parameters:

from urllib import parse

query_string = {'wd': '美女'}

encoded = parse.urlencode(query_string)

Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3 Other useful functions: quote() and unquote().

3. re (Regular Expressions)

Find all matches:

import re

matches = re.findall('pattern', html, re.S)

Or compile first:

pattern = re.compile('pattern', re.S)

matches = pattern.findall(html)

Common meta‑characters, greedy vs. non‑greedy matching, and grouping are demonstrated with examples.

4. requests Library

Installation: pip install requests GET request example:

import requests
headers = {'User-Agent': 'Mozilla/5.0 ...'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
html = res.text

POST request example:

response = requests.post(url, data=data, headers=headers)

Key parameters: url, params, headers, timeout, etc.

5. XPath Parsing

XPath selects nodes in XML/HTML documents. Example HTML snippet:

<ul class="book_list">
  <li>
    <title class="book_001">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>69.99</price>
  </li>
</ul>

Common XPath expressions:

Find all li nodes: //li Find title with specific class: //li/title[@class="book_001"] Get attribute values: //title/@class Functions like contains() and text() are used to filter nodes.

6. lxml Parsing Library

Import and parse HTML:

from lxml import etree
parse_html = etree.HTML(html)

Execute XPath queries: result = parse_html.xpath('xpath_expression') Result is always a list; iterate to extract data.

- END -

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

regex requests XPath urllib web-scraping lxml

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.