Why Brute‑Force Won’t Cut It for Sensitive‑Word Filtering (And What Actually Works)
The article walks through the evolution of sensitive‑word filtering—from naïve brute‑force scanning to Trie, Aho‑Corasick automaton, Double‑Array Trie, and DFA implementations—detailing their algorithms, time/space complexities, concrete Java code examples, performance trade‑offs, high‑concurrency optimizations, and practical production advice for building a robust content‑moderation system.
Core Conclusions
Trie – suitable for small word lists (< 10 k). Simple implementation and easy to understand.
AC Automaton – best for high‑throughput scenarios. Performs a single pass over the text and matches all keywords.
Double‑Array Trie (DAT) – designed for large word lists (> 10 k). Low memory usage (≈20‑30 % of a standard array‑based Trie) at the cost of higher construction complexity.
Algorithm Evolution
Brute‑Force (BF) Matching
Iterates every position of the input text and checks each keyword. With n keywords of average length m and text length L, the time complexity is O(n·L·m).
public List<String> bruteForceMatch(String text, List<String> words) {
List<String> result = new ArrayList<>();
for (String word : words) { // O(n)
if (text.contains(word)) { // O(L·m)
result.add(word);
}
}
return result;
}Problems
Repeated scanning of the same characters for every keyword.
No state reuse; each keyword is independent.
Latency grows linearly; with a 10 k word list the delay can reach seconds.
Trie Tree
A Trie (prefix tree) shares common prefixes among keywords, reducing redundant comparisons. Building a Trie costs O(n·m) and matching costs O(L·m). Example word list: {"高清视频", "高清 CV", "东京冷", "东京热"}.
public class SimpleTrie {
private static class Node {
Map<Character, Node> children = new HashMap<>();
boolean isEnd;
}
private final Node root = new Node();
public void addWord(String word) {
Node node = root;
for (char c : word.toCharArray()) {
node = node.children.computeIfAbsent(c, k -> new Node());
}
node.isEnd = true;
}
public boolean contains(String text) {
for (int i = 0; i < text.length(); i++) {
Node node = root;
for (int j = i; j < text.length(); j++) {
node = node.children.get(text.charAt(j));
if (node == null) break;
if (node.isEnd) return true;
}
}
return false;
}
public List<String> matchAll(String text) {
List<String> result = new ArrayList<>();
for (int i = 0; i < text.length(); i++) {
Node node = root;
for (int j = i; j < text.length(); j++) {
node = node.children.get(text.charAt(j));
if (node == null) break;
if (node.isEnd) {
result.add(text.substring(i, j + 1));
}
}
}
return result;
}
}Limitation – Backtracking : When a match fails, the algorithm restarts from the next character, leading to worst‑case O(L·m) behavior (e.g., text "aaaaaaaa" vs. pattern "aaaaab"). This motivates the Aho‑Corasick automaton.
Aho‑Corasick (AC) Automaton
Builds on a Trie and adds failure links to avoid backtracking, achieving linear time O(L + z) where z is the total number of matches. Three core functions:
goto : state transition on a character.
failure : jump to the longest proper suffix state when a mismatch occurs.
output : collect all matched keywords at the current state.
Construction uses BFS. The Java implementation:
public class AhoCorasickAutomaton {
private static class Node {
Map<Character, Node> children = new HashMap<>();
Node fail;
List<String> outputs = new ArrayList<>();
}
private final Node root = new Node();
public void addWord(String word) {
Node node = root;
for (char c : word.toCharArray()) {
node = node.children.computeIfAbsent(c, k -> new Node());
}
node.outputs.add(word);
}
public void buildFailPointer() {
Queue<Node> queue = new LinkedList<>();
root.fail = root;
for (Node child : root.children.values()) {
child.fail = root;
queue.offer(child);
}
while (!queue.isEmpty()) {
Node cur = queue.poll();
for (Map.Entry<Character, Node> e : cur.children.entrySet()) {
char c = e.getKey();
Node child = e.getValue();
Node f = cur.fail;
while (f != root && !f.children.containsKey(c)) {
f = f.fail;
}
child.fail = f.children.getOrDefault(c, root);
child.outputs.addAll(child.fail.outputs);
queue.offer(child);
}
}
}
public List<String> match(String text) {
List<String> result = new ArrayList<>();
Node state = root;
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
while (state != root && !state.children.containsKey(c)) {
state = state.fail;
}
state = state.children.getOrDefault(c, root);
result.addAll(state.outputs);
}
return result;
}
}Example with words {"she", "he", "her", "hers"} on text "ushers" yields [she, he, her, hers].
Double‑Array Trie (DAT)
DAT compresses the Trie into two integer arrays ( base[] and check[]), reducing memory to about 20‑30 % of a standard array‑based Trie. Introduced by Aoe Jun‑ichi et al. (1989). The trade‑off is higher construction cost due to conflict resolution.
Space complexity : Standard Trie – O(n·m·σ) (σ = character set size). DAT – O(n·m). Memory savings depend on the amount of shared prefixes; with no common prefixes the benefit diminishes.
Reference implementation: https://github.com/komiya-atsushi/darts-java
DFA Implementation (Hutool)
Hutool 5.8.x provides a DFA‑based filter built on a Trie. API example:
WordTree wordTree = new WordTree();
wordTree.addWord("大");
wordTree.addWord("大憨憨");
wordTree.addWord("憨憨");
String text = "那人真是个大憨憨!";
String match = wordTree.match(text); // → "大"
List<String> all = wordTree.matchAll(text, -1, false, false); // non‑greedyParameters of matchAll: limit: maximum number of matches (‑1 for unlimited). isDensityMatch: whether to continue matching inside already matched words. isGreedy: true for longest‑match (greedy), false for shortest‑match.
Example results: matchAll(text, -1, false, false) → [大, 憨憨] (non‑greedy, non‑density). matchAll(text, -1, false, true) → [大, 大憨憨] (greedy).
Handling Transformed Words
Common evasion techniques include homophones, inserted symbols, mixed simplified/traditional characters, and full‑width characters. A typical mitigation is pre‑processing that normalizes the text before matching.
public String preprocess(String text) {
StringBuilder sb = new StringBuilder();
for (char c : text.toCharArray()) {
c = toHalfWidth(c); // full‑width → half‑width
c = Character.toLowerCase(c);
if (isChineseOrAlphanumeric(c)) {
sb.append(c);
}
}
return toSimplifiedChinese(sb.toString()); // 繁体 → 简体
}
private char toHalfWidth(char c) {
if (c >= 'A' && c <= 'Z') return (char)(c - 'A' + 'A');
if (c >= 'a' && c <= 'z') return (char)(c - 'a' + 'a');
if (c >= '0' && c <= '9') return (char)(c - '0' + '0');
return c;
}
private boolean isChineseOrAlphanumeric(char c) {
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ||
(c >= '0' && c <= '9') || (c >= '\u4e00' && c <= '\u9fa5');
}Note: The preprocessing step removes characters, so the mapping between original and cleaned positions is lost. If exact position reporting is required, maintain a mapping table.
Unicode limitation: Java char is a UTF‑16 code unit; characters outside the BMP (e.g., many emojis) are represented by surrogate pairs. Use codePoints() for full Unicode support.
Mature libraries such as ToolGood.Words already embed these normalizations.
High‑Concurrency Optimizations
Atomic Hot‑Swap
Use AtomicReference<SimpleTrie> to replace the Trie instance atomically after rebuilding, ensuring ongoing requests see a consistent view.
public class SensitiveWordFilter {
private final AtomicReference<SimpleTrie> trieRef;
public SensitiveWordFilter(List<String> init) {
this.trieRef = new AtomicReference<>(buildTrie(init));
}
public List<String> match(String text) {
SimpleTrie trie = trieRef.get();
return trie != null ? trie.matchAll(text) : Collections.emptyList();
}
public void refreshWords(List<String> newWords) {
SimpleTrie newTrie = buildTrie(newWords);
trieRef.set(newTrie); // atomic publish
}
private SimpleTrie buildTrie(List<String> words) {
SimpleTrie trie = new SimpleTrie();
for (String w : words) trie.addWord(w);
return trie;
}
}Key points:
AtomicReference guarantees a lock‑free swap.
Old Trie may still be used by in‑flight threads; it will be reclaimed by GC.
New Trie can be built asynchronously.
Parallel Chunk Processing
For ultra‑long texts, split the input into overlapping chunks (overlap = maxWordLength‑1) and process each chunk in a dedicated thread pool. Overlap guarantees no match is lost across chunk boundaries.
private final ExecutorService filterExecutor =
new ThreadPoolExecutor(4, 8, 60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(1000),
new ThreadPoolExecutor.CallerRunsPolicy()); // back‑pressure
public List<String> parallelMatch(String text, int chunkSize, int maxWordLength) {
int overlap = maxWordLength - 1;
List<CompletableFuture<List<String>>> futures = new ArrayList<>();
for (int i = 0; i < text.length(); i += chunkSize) {
int start = i;
int end = Math.min(i + chunkSize + overlap, text.length());
String chunk = text.substring(start, end);
futures.add(CompletableFuture.supplyAsync(() ->
trieRef.get().matchAll(chunk), filterExecutor));
}
return futures.stream()
.flatMap(f -> f.join().stream())
.distinct()
.collect(Collectors.toList());
}Why overlap is needed: if a keyword spans the boundary between two chunks, without overlap neither chunk contains the full keyword, causing a miss.
Bloom Filter Pre‑Check
A Bloom filter can quickly discard texts that definitely contain no keywords, reducing unnecessary Trie scans. The quick‑check scans all possible substrings up to the maximum keyword length.
public List<String> matchWithBloomFilter(String text, int maxWordLength) {
if (!quickCheck(text, maxWordLength)) {
return Collections.emptyList(); // guaranteed no match
}
return trieRef.get().matchAll(text);
}
private boolean quickCheck(String text, int maxWordLen) {
BloomFilter<String> filter = getBloomFilter(); // contains all keywords
for (int i = 0; i < text.length(); i++) {
for (int len = 1; len <= maxWordLen && i + len <= text.length(); len++) {
if (filter.mightContain(text.substring(i, i + len))) {
return true; // possible match, need exact scan
}
}
}
return false; // definitely no match
}The Bloom filter is worthwhile only when the false‑positive rate is very low and most texts are clean; otherwise the extra O(L·maxWordLen) work outweighs the benefit.
Open‑Source Projects
ToolGood.Words – C#/Java/Python/Go/JS, Java 8+, multi‑language, built‑in simpl‑trad conversion, full‑width/half‑width handling; C# version > 3 B chars/s.
Hutool DFA – Java 8+, lightweight API, Trie‑based DFA, suitable for medium‑size word lists.
AhoCorasickDoubleArrayTrie – Java 7+, combines AC automaton with Double‑Array Trie, excellent performance for large word lists and high throughput.
Production Recommendations
Word‑list Management : schedule regular updates, use atomic hot‑swap, separate high/medium/low sensitivity lists.
Whitelist Mechanism : maintain exception lists or require full‑word matches to avoid false positives.
Logging & Monitoring : record matched keywords, monitor latency (p99 < 10 ms), error rates, and word‑hit ratios.
Error Handling : keep the old Trie if rebuilding fails; handle empty word‑lists explicitly; set time‑outs for extremely long inputs.
Metrics : track match latency, false‑positive/negative rates, and word‑hit distribution for continuous tuning.
Architecture Sketch
References
Apache Commons Collections – https://mvnrepository.com/artifact/org.apache.commons/commons-collections4
AhoCorasickDoubleArrayTrie – https://github.com/hankcs/AhoCorasickDoubleArrayTrie
"An Efficient Implementation of Trie Structures" (1989) – https://www.co-ding.com/assets/pdf/dat.pdf
Hutool 5.8.x – https://hutool.cn/docs/#/dfa/%E6%A6%82%E8%BF%B0
ToolGood.Words – https://github.com/toolgood/ToolGood.Words
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaGuide
Backend tech guide and AI engineering practice covering fundamentals, databases, distributed systems, high concurrency, system design, plus AI agents and large-model engineering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
