Avoid Catastrophic Backtracking: How Regex Can Crash Your Server
A detailed exploration of how a complex regular expression used for shop‑name validation caused CPU spikes due to catastrophic backtracking, explaining DFA vs NFA engines, greedy, lazy and possessive quantifiers, and offering practical tips to write performant regexes.
1. Origin of the Issue
During development of a self‑service registration feature for Lazada Seller Center, a complex shop‑name validation rule was created, allowing letters, digits, Vietnamese characters and special symbols such as “&”, “-”, “_”. The author wrote a long regular expression to enforce it.
<code>^([A-Za-z0-9._()&'\- ]|[aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ])+$</code>
In the test environment the regex worked, but after deployment to production in Malaysia the CPU spiked to 100 % and the site became sluggish. Thread dumps showed all threads stuck in the regex validation.
2. Regex Engines
Regex engines are mainly of two types: DFA (deterministic finite automaton) and NFA (nondeterministic finite automaton). DFA processes the input text left‑to‑right, never backtracks, and runs in polynomial time, but supports few features. NFA drives the matching from the pattern, may backtrack, and can have exponential worst‑case complexity.
Example
<code>text = 'after tonight' regex = 'to(nite|nighta|night)' </code>
Explanation of NFA vs DFA matching steps omitted for brevity.
3. Backtracking
The catastrophic case occurs when the engine repeatedly tries alternatives and rolls back, consuming large CPU cycles.
4. Greedy, Lazy and Possessive Quantifiers
Standard quantifiers ( ?, +, *, {min,max}) are greedy by default, trying to match as much as possible. Adding ? makes them lazy (minimal match). Adding + after a quantifier makes it possessive, which disables backtracking.
<code>ab{1,3}c # greedy ab{1,3}?c # lazy ab{1,3}+bc # possessive </code>
Diagrams illustrating greedy vs lazy vs possessive matching are shown below.
5. Summary
The original long regex can be simplified to ^[allowed‑characters]+, where the allowed set contains about 250 characters. In NFA greedy mode, matching an excessively long input can trigger massive backtracking and CPU exhaustion. Switching to a possessive quantifier resolved the 100 % CPU issue. When writing regexes, always consider performance implications in addition to functional correctness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
