Master Python Web Scraping: From urllib to Scrapy with Real-World Examples
This comprehensive guide walks you through Python web crawling fundamentals, covering request handling, URL encoding, regular expressions, the requests library, XPath parsing, and lxml, complete with code snippets and practical examples to help you build effective scrapers.
Overview of Web Crawlers
A web crawler (spider or robot) is a program that fetches web data, essentially mimicking a human browser. It is used to collect large datasets for analysis, testing, or when third‑party data is unavailable or too expensive.
Why Use Python for Crawling
Rich, mature request and parsing modules; powerful Scrapy framework.
Compared to PHP, Java, C/C++, Python offers concise code and strong library support.
Crawler Types
A. General crawlers (search engines) that obey robots.txt.
B. Custom crawlers written by developers.
Typical Crawling Steps
Identify target URLs.
Send requests and receive responses.
Extract required data from the response content.
Save data and repeat for discovered URLs.
1. urllib.request Module
Import the module:
import urllib.request from urllib import requestCreate a request with custom headers and fetch the page:
req = request.Request(url, headers={'User-Agent': 'Mozilla/5.0 ...'}) res = request.urlopen(req) html = res.read().decode('utf-8')Key methods of the response object include read(), geturl(), getcode(), and encoding/decoding helpers.
2. urllib.parse (URL Encoding)
Encode query parameters:
from urllib import parse query_string = {'wd': '美女'} encoded = parse.urlencode(query_string)Result: https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3 Other useful functions: quote() and unquote().
3. re (Regular Expressions)
Find all matches:
import re matches = re.findall('pattern', html, re.S)Or compile first:
pattern = re.compile('pattern', re.S) matches = pattern.findall(html)Common meta‑characters, greedy vs. non‑greedy matching, and grouping are demonstrated with examples.
4. requests Library
Installation: pip install requests GET request example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 ...'}
res = requests.get(url, headers=headers)
res.encoding = 'utf-8'
html = res.textPOST request example:
response = requests.post(url, data=data, headers=headers)Key parameters: url, params, headers, timeout, etc.
5. XPath Parsing
XPath selects nodes in XML/HTML documents. Example HTML snippet:
<ul class="book_list">
<li>
<title class="book_001">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>69.99</price>
</li>
</ul>Common XPath expressions:
Find all li nodes: //li Find title with specific class: //li/title[@class="book_001"] Get attribute values: //title/@class Functions like contains() and text() are used to filter nodes.
6. lxml Parsing Library
Import and parse HTML:
from lxml import etree
parse_html = etree.HTML(html)Execute XPath queries: result = parse_html.xpath('xpath_expression') Result is always a list; iterate to extract data.
- END -
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
