Why a Simple Regex Can Crash Your CPU: Understanding NFA Backtracking
The article examines a Java URL‑validation regex that caused near‑100% CPU usage, explains how Java's NFA engine and backtracking lead to catastrophic performance, and shows how to rewrite the pattern with possessive quantifiers and proper character classes to eliminate the issue.
During a project monitoring session the author noticed CPU usage approaching 100% and, after dumping the Java thread stack, found that the problem originated from a method named
validateUrlthat validates URLs with a regular expression.
A simple unit test reproduced the issue, showing the Java process consuming 91.4% CPU.
The offending regular expression is:
<code>^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\/])+$</code>It consists of three parts: protocol validation, domain validation, and parameter validation. Although the pattern looks harmless, Java's regex engine is based on an NFA (Non‑deterministic Finite Automaton) that performs backtracking. When the engine backtracks excessively, the matching time can explode, leading to the observed CPU spike.
Regex Engine
Two main implementations exist: DFA (Deterministic Finite Automaton) with linear time but limited features, and NFA, which is more powerful but may backtrack, causing unpredictable performance. Most languages, including Java, .NET, Perl, Python, Ruby, and PHP, use NFA.
NFA Backtracking
Consider the simple example:
<code>text="Today is a nice day." regex="day"</code>The NFA reads each character of the pattern and compares it to the target string, advancing on matches and backtracking when a mismatch occurs.
Another example demonstrates catastrophic backtracking:
<code>text="abbc" regex="ab{1,3}c"</code>The engine greedily matches as many
bcharacters as possible, then backtracks when the following
cdoes not align, repeating this process many times.
Root Causes in the URL Regex
The original pattern suffers from two issues:
In the domain part, the greedy
.+consumes characters beyond the intended dot, causing extensive backtracking.
The final part does not allow underscores (
_) or percent signs (
%) that appear in the problematic URL, forcing the engine to backtrack heavily.
Solution
By adding the missing characters to the third part and converting the greedy quantifier in the second part to a possessive one, backtracking is eliminated:
<code>^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~\/])+$</code>Running the program with the revised pattern prints
match!!instantly and avoids the CPU explosion.
For further verification, tools like regex101.com provide step‑by‑step debugging and show how many matching steps are required. The original pattern required over 110,000 steps, while the improved version completes in just 58 steps, demonstrating the dramatic performance gain.
The case illustrates how a seemingly innocuous regular expression can become a performance nightmare, emphasizing the importance of understanding NFA backtracking, using possessive quantifiers, and testing regexes with dedicated debugging tools.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.