Designing a 100k QPS Sensitive‑Word Filter with Real‑Time Updates
This article analyzes high‑throughput sensitive‑word filtering by comparing brute‑force, KMP, Trie, double‑array Trie and Aho‑Corasick algorithms, presents their time and space complexities, shows Java implementations for Trie and AC automata, evaluates Netty deployment options, and offers practical optimizations such as asynchronous detection, hot‑reloading, tiered responses, logging and fuzzy matching.
Problem Overview
In instant‑messaging and other user‑generated content platforms, filtering sensitive words must handle up to 100 000 queries per second (QPS) with latency below 50 ms while supporting real‑time updates to the keyword list.
Algorithmic Options
Brute‑Force (BF) : Checks each keyword against the text sequentially. Time complexity O(n·m) (n = text length, m = keyword length). Simple to implement but scales linearly with the number of keywords.
KMP (Knuth‑Morris‑Pratt) : Uses a next array to skip redundant comparisons. Achieves O(n+m) time for a single pattern; however, it must be run separately for each keyword, limiting multi‑pattern efficiency.
Trie (Prefix Tree) : Stores all keywords in a shared prefix structure. Lookup time becomes O(L) where L is the length of the examined substring, and it supports parallel multi‑pattern matching. Space consumption can be high (GB‑level for large vocabularies).
Double‑Array Trie (DAT) : A compact representation of a Trie that reduces memory usage to about 1 % of a classic Trie. Suitable for static large dictionaries but less efficient for frequent multi‑pattern scans.
Aho‑Corasick (AC) Automaton : Extends a Trie with failure pointers (fail links) inspired by KMP. Allows a single pass over the text ( O(n)) to detect all keywords simultaneously, offering the best trade‑off for high‑QPS scenarios.
Implementation Details
Trie‑based Filter (Java)
public class SensitiveWordFilter {
private final TrieNode root = new TrieNode();
// load keywords, build Trie, and filter text as shown in the source.
}AC Automaton (Java)
public class AcAutomaton {
private static class TrieNode {
Map<Character, TrieNode> children = new HashMap<>();
String word; // non‑null when node marks a keyword end
TrieNode fail; // failure link
}
private final TrieNode root = new TrieNode();
public void build(List<String> keywords) { /* construct Trie and BFS to set fail links */ }
public boolean match(String text) { /* single‑pass scan using fail links */ }
}Performance Comparison (Key Findings)
BF: O(n·m), low memory, unsuitable for large vocabularies.
Trie: O(L) per lookup, high memory, good for prefix queries.
DAT: Minimal memory, fast for single‑pattern scans, but multi‑pattern performance degrades.
AC Automaton: O(n) for all patterns, moderate memory (Trie + fail links), best for real‑time high‑QPS filtering.
Netty Integration and Technology Selection
When deploying in Netty, the following criteria are considered: performance, memory footprint, and implementation complexity.
First Choice – AC Automaton : Single‑pass detection, supports dynamic keyword reload, and scales to millions of words.
Second Choice – Double‑Array Trie : Extremely low memory, suitable for static dictionaries where multi‑pattern speed is less critical.
Not Recommended – Plain Trie or BF : High memory usage or unacceptable latency under heavy load.
Practical Optimizations
Asynchronous Detection : Offload matching to a dedicated thread pool to keep Netty I/O threads non‑blocking.
Dynamic Reload : Use WatchService to monitor keyword files and rebuild the automaton on change.
Tiered Response : Attach severity levels to keywords and return different messages (warning vs. block).
Logging : Record matched keyword, original message, and user info for audit.
Fuzzy Matching : Combine regular‑expression patterns with the AC automaton to catch variations (e.g., pinyin, homophones).
Open‑Source References
Apache Commons Collections provides a PatriciaTrie implementation.
Java AC automaton library org.ahocorasick for lightweight multi‑pattern matching.
Hankcs AhoCorasickDoubleArrayTrie (GitHub) demonstrates a high‑performance DAT‑based AC automaton.
By following the described analysis, choosing the AC automaton for Netty services, and applying the listed optimizations, a system can reliably filter sensitive words at 100 k QPS with sub‑50 ms latency while allowing real‑time keyword updates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
