Master Python Web Scraping: From urllib to Scrapy with Real-World Examples
This comprehensive guide walks you through Python web crawling fundamentals, covering request handling, URL encoding, regular expressions, the requests library, XPath parsing, and lxml, complete with code snippets and practical examples to help you build effective scrapers.
Overview of Web Crawlers
A web crawler (spider or robot) is a program that fetches web data, essentially mimicking a human browser. It is used to collect large datasets for analysis, testing, or when third‑party data is unavailable or too expensive.
Why Use Python for Crawling
Rich, mature request and parsing modules; powerful Scrapy framework.
Compared to PHP, Java, C/C++, Python offers concise code and strong library support.
Crawler Types
A. General crawlers (search engines) that obey robots.txt.
B. Custom crawlers written by developers.
Typical Crawling Steps
Identify target URLs.
Send requests and receive responses.
Extract required data from the response content.
Save data and repeat for discovered URLs.
1. urllib.request Module
Import the module:
<code>import urllib.request</code> <code>from urllib import request</code>Create a request with custom headers and fetch the page:
<code>req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0 ...'})</code> <code>res = request.urlopen(req)</code> <code>html = res.read().decode('utf-8')</code>Key methods of the response object include read() , geturl() , getcode() , and encoding/decoding helpers.
2. urllib.parse (URL Encoding)
Encode query parameters:
<code>from urllib import parse</code> <code>query_string = {'wd': '美女'}</code> <code>encoded = parse.urlencode(query_string)</code>Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3
Other useful functions: quote() and unquote() .
3. re (Regular Expressions)
Find all matches:
<code>import re</code> <code>matches = re.findall('pattern', html, re.S)</code>Or compile first:
<code>pattern = re.compile('pattern', re.S)</code> <code>matches = pattern.findall(html)</code>Common meta‑characters, greedy vs. non‑greedy matching, and grouping are demonstrated with examples.
4. requests Library
Installation:
<code>pip install requests</code>GET request example:
<code>import requests
headers = {'User-Agent': 'Mozilla/5.0 ...'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
html = res.text</code>POST request example:
<code>response = requests.post(url, data=data, headers=headers)</code>Key parameters: url , params , headers , timeout , etc.
5. XPath Parsing
XPath selects nodes in XML/HTML documents. Example HTML snippet:
<code><ul class="book_list">
<li>
<title class="book_001">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>69.99</price>
</li>
</ul></code>Common XPath expressions:
Find all li nodes: //li
Find title with specific class: //li/title[@class="book_001"]
Get attribute values: //title/@class
Functions like contains() and text() are used to filter nodes.
6. lxml Parsing Library
Import and parse HTML:
<code>from lxml import etree
parse_html = etree.HTML(html)</code>Execute XPath queries:
<code>result = parse_html.xpath('xpath_expression')</code>Result is always a list; iterate to extract data.
- END -
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.