Improving Large-Scale Regex Matching Performance with Hyperscan and Flink

This article explains how to boost the efficiency of massive regular‑expression matching by using Intel's Hyperscan library, integrating it with Apache Flink for streaming processing, and providing deployment guidelines for both private and internal environments.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
Improving Large-Scale Regex Matching Performance with Hyperscan and Flink

Background : Large‑scale regex matching is widely used in security scenarios such as detecting compromised FTP accounts and deep‑packet inspection, where rule sets can reach tens of thousands and data arrives as streams.

Challenges : High rule count, need for low‑latency matching, streaming data support, and limited resource consumption make traditional regex engines unsuitable.

Hyperscan Overview : Hyperscan is an open‑source, high‑performance regex engine from Intel that supports most PCRE syntax, streaming matching, multi‑pattern matching, and CPU‑specific instruction set acceleration. It requires CPUs with at least SSSE3 support.

Engine Architecture : Hyperscan operates in two phases—compilation (building a database from regexes) and matching (runtime scanning). The compilation step can be performed offline and the resulting database serialized for reuse.

Integration with Flink : Because Hyperscan is single‑node only, the solution embeds a custom Flink UDF operator that offloads matching to a Hyperscan subprocess. The operator receives input records, forwards relevant fields to Hyperscan, and returns match results (hit, miss, error, timeout) along with the matched regex IDs.

Deployment Options :

Private deployment: users compile regexes into a serialized database, optionally store it on HDFS, and reference it in Flink jobs to avoid repeated compilation.

Internal platform deployment: a platform (e.g., 奇麟) handles compilation and distribution of the database automatically.

Usage Example : The article demonstrates matching HTTP Host and Referer fields using a four‑step pipeline—source stream creation, conversion to a HyperscanStream, invoking the Hyperscan function with target fields, and processing the Tuple2 result containing the original event and a list of HyperScanRecord objects.

Performance : Tests with 10,000 regex rules show the solution meets expected latency and throughput, with visual performance charts provided.

Best Practices & Limitations : Users should be aware of unsupported PCRE constructs (unless using the Chimera library) and the need for CPU instruction support.

Future Outlook : Plans include expanding use cases beyond security (e.g., text moderation), and adding dynamic rule hot‑loading to avoid job restarts.

FlinkStreamingsecurityregexhyperscan
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.