Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction

This guide walks through selecting Scrapy over Requests + BeautifulSoup, explains web page types, outlines crawler use‑cases, details regular‑expression syntax and non‑greedy matching, demonstrates practical regex patterns with images, compares depth‑first and breadth‑first crawling, and covers URL deduplication and string‑encoding pitfalls in Python.

JavaEdge
JavaEdge
JavaEdge
Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction

Technology Selection: Scrapy vs Requests + BeautifulSoup

Scrapy is chosen because it is a full‑featured framework that integrates requests ‑like functionality, supports the asynchronous twisted I/O engine for high performance, provides built‑in CSS and XPath selectors, and offers many extensions that speed up development.

Web Page Classification

Static pages – pre‑generated HTML with fixed content.

Dynamic pages – content generated on the server per request.

Webservice (REST API) – dynamic data accessed via AJAX calls.

What Crawlers Can Do

Search engines (e.g., Baidu, Google).

Recommendation engines (e.g., Toutiao).

Provide data samples for machine‑learning models.

Support data analysis such as financial or sentiment analysis.

Why Regular Expressions Are Needed

Even with CSS or XPath selectors, you often need to further filter extracted strings (e.g., isolate numbers, specific patterns). Regex lets you test whether a string matches a pattern and extract the important parts.

Common Regex Tokens

^   : start of string
$   : end of string
.   : any character
*   : zero or more repetitions
+   : one or more repetitions
?   : makes preceding token non‑greedy or optional
()  : capture group
{n} : exactly n repetitions
{n,} : at least n repetitions
{n,m}: between n and m repetitions
|   : logical OR
[]  : character class (e.g., [a‑z])
\s  : whitespace character
\S  : non‑whitespace character
\w  : word character [A‑Za‑z0‑9_]
\W  : non‑word character
\d  : digit
\u4E00-\u9FA5 : any Chinese character (Unicode range)

Coding Demonstration

Start anchor: ^J – matches strings beginning with “J”.

End anchor: 4$ – matches strings ending with “4”.

Combined pattern: ^J.*4$ – strings that start with “J”, have any characters in between, and end with “4”.

Non‑Greedy Matching

By default regex is greedy (matches the longest possible substring). Adding ? after a quantifier makes it non‑greedy, matching the shortest possible substring.

For example, o+? on “oooo” matches a single “o”, while o+ matches all four “o” characters.

Original greedy pattern extracted “bb” from “bobby123”.

Changing to non‑greedy ? fixed the extraction.

Limiting Occurrences

Exact count: {1} – exactly one occurrence.

At least three: {3,}.

Between two and five: {2,5}.

Character Classes and OR Operator

Class: [abc] – any of a, b, or c.

Negated class: [^0-9] – any character except digits.

OR: a|b – matches “a” or “b”.

Unicode & Chinese Characters

Use \u4E00-\u9FA5 to match any Chinese character. Example images illustrate matching university names and phone numbers.

Depth‑First vs Breadth‑First Crawling

Scrapy’s default scheduler uses depth‑first (recursive) traversal; breadth‑first can be implemented with a queue.

Depth‑first explores a branch completely before backtracking; breadth‑first visits all nodes at the current depth before moving deeper.

URL Deduplication Strategies

Store visited URLs in a database and query each time (slow).

Keep URLs in a Python set for O(1) look‑ups (high memory usage).

Hash URLs with md5 and store the hash in a set to reduce size.

Use a bitmap to map hashed URLs to bits (compact but higher collision risk).

Bloom filter – multiple hash functions to lower collisions while keeping memory low.

String Encoding Overview

ASCII uses one byte (0‑255) and cannot represent Chinese characters. GB2312 uses two bytes for Chinese glyphs and includes ASCII. Unicode unifies all scripts; UTF‑8 encodes ASCII in one byte and Chinese characters in three‑four bytes, saving space for English‑heavy text.

Python 2 stores strings as bytes, requiring explicit decode to Unicode before encode. Python 3 uses Unicode internally, eliminating the need for manual conversion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonregexScrapyWeb Crawling
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.