Understanding Greedy and Non‑Greedy Matching in Regular Expressions
This article explains the difference between greedy and non‑greedy (lazy) matching in regular expressions, describes how quantifiers behave by default, shows how to switch to lazy mode using a trailing question mark, and provides multiple Python code examples illustrating both approaches.
Regular expressions use quantifiers such as *, +, ?, and {} to specify how many characters to match. By default these quantifiers are greedy, meaning they try to consume as many characters as possible; adding a trailing question mark makes them non‑greedy (lazy), causing them to match the smallest possible portion.
Greedy Matching
In greedy mode the engine expands the match to the longest possible string that still satisfies the pattern. The following Python example demonstrates this behavior when searching for HTML‑like tags.
import re text = "Here is some text with
and
." pattern = r'<.*>' match = re.search(pattern, text) print(match.group()) # output:
andThe pattern <.*> starts at the first '<' and continues until the last '>', capturing everything in between.
Non‑Greedy (Lazy) Matching
Appending ? after a quantifier forces the engine to stop as soon as the rest of the pattern can be satisfied. The example below extracts each tag individually.
import re text = "Here is some text with
and
." pattern = r'<.*?>' matches = re.findall(pattern, text) print(matches) # output: ['
', '
']Here the engine stops at the first closing '>', returning separate matches for each tag.
Key Points Summary
• Greedy quantifiers (*, +, ?, {n,m}) are the default and match as many characters as possible. • Non‑greedy quantifiers (*?, +?, ??, {n,m}?) match the minimal number of characters needed. • Choosing between them depends on the structure of the data you need to extract.
Additional Illustrative Examples
1. HTML tag matching (greedy vs lazy)
import re text = "FirstSecond" pattern_greedy = r'.*' print(re.findall(pattern_greedy, text)) # ['FirstSecond'] pattern_lazy = r'.*?' print(re.findall(pattern_lazy, text)) # ['F', 'i', 'r', 's', 't', 'S', 'e', 'c', 'o', 'n', 'd']2. Matching repeated words
import re text = "This is a test test sentence." pattern_greedy = r"(\b\w+\b)\s+\1" match = re.search(pattern_greedy, text) print("Greedy:", match.group(0)) # Greedy: test test pattern_lazy = r"(\b\w+\b)\s+?\1" match = re.search(pattern_lazy, text) print("Lazy:", match.group(0)) # Lazy: test testBoth patterns produce the same result here because the whitespace quantifier already matches minimally.
3. Extracting file names from paths
import re path = "/home/user/documents/report.docx" pattern_greedy = r".*/(.*)" match = re.search(pattern_greedy, path) print("Greedy file name:", match.group(1)) # report.docx pattern_lazy = r".*?/(.*)" match = re.search(pattern_lazy, path) print("Lazy file name:", match.group(1)) # documents/report.docxThe greedy pattern captures everything after the last slash, while the lazy pattern stops at the first slash, demonstrating how the choice of quantifier affects the result.
Conclusion
Greedy matching (*, +, ?, {n,m}) is the default behavior and captures the longest possible substring; non‑greedy matching (*?, +?, ??, {n,m}?) captures the shortest possible substring. Understanding and selecting the appropriate mode allows precise control over pattern extraction in regular expressions.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.