Master Python Regular Expressions: From Basics to Real-World Scraping
This article introduces Python's re module, explains core regex functions such as match, search, replace, and compile, demonstrates practical code examples with images, and shows how to apply regular expressions for web‑scraping tasks.
1. Introduction
Regular expressions are special character sequences that help you check whether a string matches a given pattern. This article continues a series on Python regex, linking to previous parts.
2. Overview of the re Module
Since Python 1.5, the re module provides Perl‑style regular expression support, giving Python full regex capabilities.
3. re.match Function
re.match(pattern, string, flags=0)attempts to match a pattern at the start of a string; if it fails, it returns None. On success it returns a match object.
You can retrieve matched text with group(num) or groups(). group(0) returns the entire match; multiple group numbers return a tuple of corresponding substrings.
Example:
Result:
4. Search and Replace with re.sub
re.sub(pattern, repl, string, count=0, flags=0)replaces occurrences of a pattern in a string.
pattern : regex pattern string
repl : replacement string or function
string : original text
count : maximum replacements (0 means all)
flags : optional regex flags
Example:
Result:
5. re.compile Function
re.compile(pattern[, flags])compiles a regex pattern for reuse with match() and search(). Flags include: re.I: ignore case re.L: locale‑aware re.M: multiline mode (affects ^ and $) re.S: dot matches newline re.U: Unicode character properties re.X: verbose mode (allows comments and whitespace)
6. Regex Objects
re.RegexObjectis returned by re.compile(). re.MatchObject provides methods such as group(), start(), end(), and span() to access match details.
7. Regex Modifiers (Flags)
Flags control matching behavior and can be combined with bitwise OR (e.g., re.I | re.M).
8. Regex Pattern Syntax
Pattern strings use special syntax: literals match themselves, escaped characters gain special meaning, and raw strings (e.g., r'\t') are recommended to avoid double escaping.
Typical elements include character classes, quantifiers, anchors, and groups. A reference table (image) illustrates these components.
9. Practical Application: Scraping Movie Data
Using regex to extract movie titles, authors, and release dates from a web page (example: Maoyan movies). The HTML structure is navigated by locating <div> containers and extracting data with a compiled pattern.
pattern = re.compile('<div>.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>', re.S)The pattern captures the movie name, starring actor, and release time, enabling batch extraction of multiple fields.
10. Summary
Regular expressions are ideal for scenarios requiring extraction of multiple data points efficiently.
This article covered regex fundamentals, core functions, flags, pattern syntax, and a real‑world scraping example.
For further Python learning, refer to additional resources.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
