Fundamentals 9 min read

Master Python Regular Expressions: From Basics to Real-World Scraping

This article introduces Python's re module, explains core regex functions such as match, search, replace, and compile, demonstrates practical code examples with images, and shows how to apply regular expressions for web‑scraping tasks.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Python Regular Expressions: From Basics to Real-World Scraping

1. Introduction

Regular expressions are special character sequences that help you check whether a string matches a given pattern. This article continues a series on Python regex, linking to previous parts.

2. Overview of the re Module

Since Python 1.5, the re module provides Perl‑style regular expression support, giving Python full regex capabilities.

3. re.match Function

re.match(pattern, string, flags=0)

attempts to match a pattern at the start of a string; if it fails, it returns None. On success it returns a match object.

You can retrieve matched text with group(num) or groups(). group(0) returns the entire match; multiple group numbers return a tuple of corresponding substrings.

Example:

Result:

4. Search and Replace with re.sub

re.sub(pattern, repl, string, count=0, flags=0)

replaces occurrences of a pattern in a string.

pattern : regex pattern string

repl : replacement string or function

string : original text

count : maximum replacements (0 means all)

flags : optional regex flags

Example:

Result:

5. re.compile Function

re.compile(pattern[, flags])

compiles a regex pattern for reuse with match() and search(). Flags include: re.I: ignore case re.L: locale‑aware re.M: multiline mode (affects ^ and $) re.S: dot matches newline re.U: Unicode character properties re.X: verbose mode (allows comments and whitespace)

6. Regex Objects

re.RegexObject

is returned by re.compile(). re.MatchObject provides methods such as group(), start(), end(), and span() to access match details.

7. Regex Modifiers (Flags)

Flags control matching behavior and can be combined with bitwise OR (e.g., re.I | re.M).

8. Regex Pattern Syntax

Pattern strings use special syntax: literals match themselves, escaped characters gain special meaning, and raw strings (e.g., r'\t') are recommended to avoid double escaping.

Typical elements include character classes, quantifiers, anchors, and groups. A reference table (image) illustrates these components.

9. Practical Application: Scraping Movie Data

Using regex to extract movie titles, authors, and release dates from a web page (example: Maoyan movies). The HTML structure is navigated by locating <div> containers and extracting data with a compiled pattern.

pattern = re.compile('<div>.*?title="(.*?)".*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>', re.S)

The pattern captures the movie name, starring actor, and release time, enabling batch extraction of multiple fields.

10. Summary

Regular expressions are ideal for scenarios requiring extraction of multiple data points efficiently.

This article covered regex fundamentals, core functions, flags, pattern syntax, and a real‑world scraping example.

For further Python learning, refer to additional resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonWeb Scrapingpattern-matchingre module
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.