Fundamentals 12 min read

Why a Single Regex Can Crash Your Java Service: Understanding NFA Backtracking

An unexpected CPU spike in a Java service was traced to a complex URL‑validation regex whose NFA backtracking caused catastrophic performance, and the article explains the regex engine’s behavior, identifies the problematic pattern, and shows how to refactor the expression to eliminate excessive backtracking.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Single Regex Can Crash Your Java Service: Understanding NFA Backtracking

We observed a Java service where CPU usage rose to nearly 100% after a request triggered the

validateUrl

method, which validates URLs with a regular expression.

A simple unit test reproduces the issue:

<code>public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}
</code>

Running the test shows the Java process consuming about 91% CPU.

CPU usage screenshot
CPU usage screenshot

Regex Engine

Java uses an NFA (Non‑deterministic Finite Automaton) engine. NFA matches by reading the pattern character by character and backtracking when a mismatch occurs. Backtracking can become extremely costly for certain patterns.

NFA Backtracking

When the engine encounters the URL‑validation regex, it repeatedly backtracks because the pattern is greedy and the input URL contains long, complex segments (including underscores and percent signs) that the third part of the regex does not accept.

<code>^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)+([A-Za-z0-9-~\/])+$</code>

The first part (protocol) is fine, but the second part greedily consumes the domain and the following path, causing the engine to read far beyond the intended dot separator. The third part lacks support for '_' and '%', forcing the engine to backtrack many times.

Solution

Adding the missing characters to the third part and making the second part possessive (using

++

) eliminates backtracking:

<code>public static void main(String[] args) {
    String badRegex = "^([hH][tT]{2}[pP]://|[hH][tT]{2}[pP][sS]://)(([A-Za-z0-9-~]+).)++([A-Za-z0-9-~_%\\/])+$";
    String bugUrl = "http://www.fapiao.com/dddp-web/pdf/download?request=6e7JGxxxxx4ILd-kExxxxxxxqJ4-CHLmqVnenXC692m74H38sdfdsazxcUmfcOH2fAfY1Vw__%5EDadIfJgiEf";
    if (bugUrl.matches(badRegex)) {
        System.out.println("match!!");
    } else {
        System.out.println("no match!!");
    }
}
</code>

After the change the program prints

match!!

instantly and CPU usage stays normal.

For more robust patterns, consider using lazy quantifiers or atomic groups, and always test with a regex debugger to detect catastrophic backtracking.

Regex debugger steps
Regex debugger steps

Conclusion

A seemingly simple URL‑validation regex can cause severe CPU overload due to NFA backtracking. Understanding the engine’s behavior and applying possessive quantifiers or expanding the character class prevents catastrophic performance loss.

JavaPerformanceNFACPUbacktrackingRegex
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.