Build a Millisecond‑Scale Sensitive Word Filter with DFA and Trie in Java
This article explains why traditional string matching and regex struggle with large keyword sets, introduces a DFA‑based solution using a Trie tree for linear‑time detection, provides full Java implementations, shows real‑world integration scenarios, and explores advanced optimizations such as double‑array tries, Aho‑Corasick automata, and sharding with Bloom filters.
Why DFA‑based filtering?
Brute‑force string matching has time complexity O(n × m × k) and becomes impractically slow when the keyword list reaches thousands. Regular‑expression engines also suffer from long compilation time, high memory consumption and inability to add keywords at runtime.
DFA advantages
Linear time complexity : O(n) – the text is scanned only once, independent of the number of keywords.
Space sharing : common prefixes are stored once, greatly reducing memory usage.
Deterministic matching : no backtracking, unlike regex engines.
Dynamic extensibility : keywords can be added or removed at runtime without rebuilding the whole structure.
DFA principle
A deterministic finite automaton (DFA) is defined by the 5‑tuple M = (Q, Σ, δ, q₀, F), where Q is the set of states, Σ the input alphabet, δ the transition function, q₀ the start state and F the set of accepting states. In a sensitive‑word filter each state represents a position in a potential keyword.
Trie as a concrete DFA
Trie (prefix tree) is the visual implementation of the DFA. For the keyword list ["apple", "app", "application", "apply", "orange"] the structure is:
root
├── a
│ └── p
│ └── p [end] ← "app"
│ └── l
│ ├── e [end] ← "apple"
│ ├── i
│ │ └── c
│ │ └── a
│ │ └── t
│ │ └── i
│ │ └── o
│ │ └── n [end] ← "application"
│ └── y [end] ← "apply"
└── o
└── r
└── a
└── n
└── g
└── e [end] ← "orange"Key characteristics:
Prefix sharing – the node app stores the common prefix for three words.
Each edge ( ├── or └──) represents a character transition.
Nodes marked [end] are accepting states, indicating the end of a sensitive word.
Memory efficiency – five words require only ten character nodes instead of 29 if stored separately.
Core Java implementation
Trie node definition
public class TrieNode {
private Map<Character, TrieNode> children = new HashMap<>();
private boolean isEnd = false;
private String keyword;
public TrieNode getChild(char c) { return children.get(c); }
public TrieNode addChild(char c) { return children.computeIfAbsent(c, k -> new TrieNode()); }
public boolean hasChild(char c) { return children.containsKey(c); }
// getters and setters omitted for brevity
}Sensitive‑word filter (core DFA logic)
public class SensitiveWordFilter {
private TrieNode root;
private int minWordLength = 1;
public SensitiveWordFilter(List<String> sensitiveWords) {
this.root = buildTrie(sensitiveWords);
this.minWordLength = sensitiveWords.stream()
.mapToInt(String::length).min().orElse(1);
}
private TrieNode buildTrie(List<String> words) {
TrieNode root = new TrieNode();
for (String word : words) {
TrieNode node = root;
for (char c : word.toCharArray()) {
node = node.addChild(c);
}
node.isEnd = true;
node.keyword = word;
}
return root;
}
public boolean containsSensitiveWord(String text) {
if (text == null || text.length() < minWordLength) return false;
char[] chars = text.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (dfaMatch(chars, i)) return true;
}
return false;
}
private boolean dfaMatch(char[] chars, int start) {
TrieNode node = root;
for (int i = start; i < chars.length; i++) {
char c = chars[i];
if (!node.hasChild(c)) break; // transition fails
node = node.getChild(c);
if (node.isEnd) return true; // accepting state reached
}
return false;
}
// findAllWords, filter, and result wrapper omitted for brevity
}Integration examples
Asynchronous chat message filtering
public class ChatMessageFilter {
private SensitiveWordFilter wordFilter;
private ExecutorService filterExecutor = Executors.newFixedThreadPool(10);
public CompletableFuture<Message> filterMessageAsync(Message message) {
return CompletableFuture.supplyAsync(() -> {
String content = message.getContent();
if (wordFilter.containsSensitiveWord(content)) {
String filtered = wordFilter.filter(content, "***");
message.setContent(filtered);
recordSensitiveWords(content);
}
return message;
}, filterExecutor);
}
private void recordSensitiveWords(String content) {
List<SensitiveWordResult> words = wordFilter.findAllWords(content);
updateWordFrequency(words);
}
}Multi‑level content audit
public class ContentAuditor {
private SensitiveWordFilter highRiskFilter;
private SensitiveWordFilter mediumRiskFilter;
private SensitiveWordFilter lowRiskFilter;
public AuditResult auditContent(String content) {
AuditResult result = new AuditResult();
if (!highRiskFilter.findAllWords(content).isEmpty()) {
result.setStatus(AuditStatus.REJECT);
result.setReason("Contains high‑risk keywords");
return result;
}
if (!mediumRiskFilter.findAllWords(content).isEmpty()) {
result.setStatus(AuditStatus.MANUAL_REVIEW);
result.setReason("Contains medium‑risk keywords, needs manual review");
return result;
}
if (!lowRiskFilter.findAllWords(content).isEmpty()) {
String filtered = lowRiskFilter.filter(content, "***");
result.setFilteredContent(filtered);
result.setStatus(AuditStatus.PASS_WITH_FILTER);
} else {
result.setStatus(AuditStatus.PASS);
}
return result;
}
}Dynamic word‑list management (periodic reload)
@Service
public class SensitiveWordManager {
private volatile SensitiveWordFilter filter;
private ScheduledExecutorService updateExecutor = Executors.newSingleThreadScheduledExecutor();
@PostConstruct
public void init() {
loadWords();
updateExecutor.scheduleAtFixedRate(this::loadWords, 0, 1, TimeUnit.HOURS);
}
public void loadWords() {
try {
List<String> words = fetchLatestWords();
this.filter = new SensitiveWordFilter(words);
log.info("Sensitive‑word list refreshed, count: {}", words.size());
} catch (Exception e) {
log.error("Word‑list update failed", e);
}
}
public boolean containsSensitiveWord(String text) {
return filter != null && filter.containsSensitiveWord(text);
}
public String filterText(String text) {
return filter != null ? filter.filter(text, "***") : text;
}
}Advanced optimization schemes
Double‑Array Trie
Compresses the Trie into two integer arrays ( base and check) to reduce memory consumption by 50‑80 % at the cost of more complex construction and limited dynamic updates. Suitable for very large offline dictionaries.
int[] base = new int[size]; // transition base values
int[] check = new int[size]; // parent state verification
// transition: next = base[current] + charCode
// if (check[next] == current) transition succeedsAho‑Corasick automaton
Extends the Trie with failure links, enabling a single pass to match all patterns simultaneously. Ideal for multi‑pattern log analysis or security scanning, though it consumes more memory and is harder to implement.
public class AhoCorasickAutomaton {
private Node root = new Node();
static class Node {
Map<Character, Node> children = new HashMap<>();
Node fail; // failure link
List<String> output = new ArrayList<>(); // matched patterns
}
public void build(List<String> patterns) { /* build Trie then BFS to set fail links */ }
public List<MatchResult> search(String text) { /* single‑pass matching */ }
static class MatchResult { String pattern; int start; int end; }
}Sharding + Bloom‑filter pre‑screening
Splits the keyword set across multiple filter instances and uses a Bloom filter to quickly discard texts that definitely contain no keywords, reducing unnecessary shard lookups. Provides horizontal scalability for high‑throughput services, at the expense of a small false‑positive rate.
public class ShardedFilter {
private List<SensitiveWordFilter> shards;
private BloomFilter<String> preFilter;
public boolean containsSensitive(String text) {
if (!preFilter.mightContain(text)) return false; // guaranteed miss
int shardIdx = Math.abs(text.hashCode()) % shards.size();
return shards.get(shardIdx).containsSensitiveWord(text);
}
}Repository
All example code is available at:
https://github.com/yuboon/java-examples/tree/master/springboot-dfaSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
