Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction
This guide walks through selecting Scrapy over Requests + BeautifulSoup, explains web page types, outlines crawler use‑cases, details regular‑expression syntax and non‑greedy matching, demonstrates practical regex patterns with images, compares depth‑first and breadth‑first crawling, and covers URL deduplication and string‑encoding pitfalls in Python.
Technology Selection: Scrapy vs Requests + BeautifulSoup
Scrapy is chosen because it is a full‑featured framework that integrates requests ‑like functionality, supports the asynchronous twisted I/O engine for high performance, provides built‑in CSS and XPath selectors, and offers many extensions that speed up development.
Web Page Classification
Static pages – pre‑generated HTML with fixed content.
Dynamic pages – content generated on the server per request.
Webservice (REST API) – dynamic data accessed via AJAX calls.
What Crawlers Can Do
Search engines (e.g., Baidu, Google).
Recommendation engines (e.g., Toutiao).
Provide data samples for machine‑learning models.
Support data analysis such as financial or sentiment analysis.
Why Regular Expressions Are Needed
Even with CSS or XPath selectors, you often need to further filter extracted strings (e.g., isolate numbers, specific patterns). Regex lets you test whether a string matches a pattern and extract the important parts.
Common Regex Tokens
^ : start of string
$ : end of string
. : any character
* : zero or more repetitions
+ : one or more repetitions
? : makes preceding token non‑greedy or optional
() : capture group
{n} : exactly n repetitions
{n,} : at least n repetitions
{n,m}: between n and m repetitions
| : logical OR
[] : character class (e.g., [a‑z])
\s : whitespace character
\S : non‑whitespace character
\w : word character [A‑Za‑z0‑9_]
\W : non‑word character
\d : digit
\u4E00-\u9FA5 : any Chinese character (Unicode range)Coding Demonstration
Start anchor: ^J – matches strings beginning with “J”.
End anchor: 4$ – matches strings ending with “4”.
Combined pattern: ^J.*4$ – strings that start with “J”, have any characters in between, and end with “4”.
Non‑Greedy Matching
By default regex is greedy (matches the longest possible substring). Adding ? after a quantifier makes it non‑greedy, matching the shortest possible substring.
For example, o+? on “oooo” matches a single “o”, while o+ matches all four “o” characters.
Original greedy pattern extracted “bb” from “bobby123”.
Changing to non‑greedy ? fixed the extraction.
Limiting Occurrences
Exact count: {1} – exactly one occurrence.
At least three: {3,}.
Between two and five: {2,5}.
Character Classes and OR Operator
Class: [abc] – any of a, b, or c.
Negated class: [^0-9] – any character except digits.
OR: a|b – matches “a” or “b”.
Unicode & Chinese Characters
Use \u4E00-\u9FA5 to match any Chinese character. Example images illustrate matching university names and phone numbers.
Depth‑First vs Breadth‑First Crawling
Scrapy’s default scheduler uses depth‑first (recursive) traversal; breadth‑first can be implemented with a queue.
Depth‑first explores a branch completely before backtracking; breadth‑first visits all nodes at the current depth before moving deeper.
URL Deduplication Strategies
Store visited URLs in a database and query each time (slow).
Keep URLs in a Python set for O(1) look‑ups (high memory usage).
Hash URLs with md5 and store the hash in a set to reduce size.
Use a bitmap to map hashed URLs to bits (compact but higher collision risk).
Bloom filter – multiple hash functions to lower collisions while keeping memory low.
String Encoding Overview
ASCII uses one byte (0‑255) and cannot represent Chinese characters. GB2312 uses two bytes for Chinese glyphs and includes ASCII. Unicode unifies all scripts; UTF‑8 encodes ASCII in one byte and Chinese characters in three‑four bytes, saving space for English‑heavy text.
Python 2 stores strings as bytes, requiring explicit decode to Unicode before encode. Python 3 uses Unicode internally, eliminating the need for manual conversion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
