Build a Millisecond‑Scale Sensitive Word Filter with DFA and Trie in Java

This article explains why traditional string matching and regex struggle with large keyword sets, introduces a DFA‑based solution using a Trie tree for linear‑time detection, provides full Java implementations, shows real‑world integration scenarios, and explores advanced optimizations such as double‑array tries, Aho‑Corasick automata, and sharding with Bloom filters.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Build a Millisecond‑Scale Sensitive Word Filter with DFA and Trie in Java

Why DFA‑based filtering?

Brute‑force string matching has time complexity O(n × m × k) and becomes impractically slow when the keyword list reaches thousands. Regular‑expression engines also suffer from long compilation time, high memory consumption and inability to add keywords at runtime.

DFA advantages

Linear time complexity : O(n) – the text is scanned only once, independent of the number of keywords.

Space sharing : common prefixes are stored once, greatly reducing memory usage.

Deterministic matching : no backtracking, unlike regex engines.

Dynamic extensibility : keywords can be added or removed at runtime without rebuilding the whole structure.

DFA principle

A deterministic finite automaton (DFA) is defined by the 5‑tuple M = (Q, Σ, δ, q₀, F), where Q is the set of states, Σ the input alphabet, δ the transition function, q₀ the start state and F the set of accepting states. In a sensitive‑word filter each state represents a position in a potential keyword.

Trie as a concrete DFA

Trie (prefix tree) is the visual implementation of the DFA. For the keyword list ["apple", "app", "application", "apply", "orange"] the structure is:

root
├── a
│   └── p
│       └── p [end]   ← "app"
│           └── l
│               ├── e [end]   ← "apple"
│               ├── i
│               │   └── c
│               │       └── a
│               │           └── t
│               │               └── i
│               │                   └── o
│               │                       └── n [end]   ← "application"
│               └── y [end]   ← "apply"
└── o
    └── r
        └── a
            └── n
                └── g
                    └── e [end]   ← "orange"

Key characteristics:

Prefix sharing – the node app stores the common prefix for three words.

Each edge ( ├── or └──) represents a character transition.

Nodes marked [end] are accepting states, indicating the end of a sensitive word.

Memory efficiency – five words require only ten character nodes instead of 29 if stored separately.

Core Java implementation

Trie node definition

public class TrieNode {
    private Map<Character, TrieNode> children = new HashMap<>();
    private boolean isEnd = false;
    private String keyword;

    public TrieNode getChild(char c) { return children.get(c); }
    public TrieNode addChild(char c) { return children.computeIfAbsent(c, k -> new TrieNode()); }
    public boolean hasChild(char c) { return children.containsKey(c); }
    // getters and setters omitted for brevity
}

Sensitive‑word filter (core DFA logic)

public class SensitiveWordFilter {
    private TrieNode root;
    private int minWordLength = 1;

    public SensitiveWordFilter(List<String> sensitiveWords) {
        this.root = buildTrie(sensitiveWords);
        this.minWordLength = sensitiveWords.stream()
                .mapToInt(String::length).min().orElse(1);
    }

    private TrieNode buildTrie(List<String> words) {
        TrieNode root = new TrieNode();
        for (String word : words) {
            TrieNode node = root;
            for (char c : word.toCharArray()) {
                node = node.addChild(c);
            }
            node.isEnd = true;
            node.keyword = word;
        }
        return root;
    }

    public boolean containsSensitiveWord(String text) {
        if (text == null || text.length() < minWordLength) return false;
        char[] chars = text.toCharArray();
        for (int i = 0; i < chars.length; i++) {
            if (dfaMatch(chars, i)) return true;
        }
        return false;
    }

    private boolean dfaMatch(char[] chars, int start) {
        TrieNode node = root;
        for (int i = start; i < chars.length; i++) {
            char c = chars[i];
            if (!node.hasChild(c)) break; // transition fails
            node = node.getChild(c);
            if (node.isEnd) return true; // accepting state reached
        }
        return false;
    }

    // findAllWords, filter, and result wrapper omitted for brevity
}

Integration examples

Asynchronous chat message filtering

public class ChatMessageFilter {
    private SensitiveWordFilter wordFilter;
    private ExecutorService filterExecutor = Executors.newFixedThreadPool(10);

    public CompletableFuture<Message> filterMessageAsync(Message message) {
        return CompletableFuture.supplyAsync(() -> {
            String content = message.getContent();
            if (wordFilter.containsSensitiveWord(content)) {
                String filtered = wordFilter.filter(content, "***");
                message.setContent(filtered);
                recordSensitiveWords(content);
            }
            return message;
        }, filterExecutor);
    }

    private void recordSensitiveWords(String content) {
        List<SensitiveWordResult> words = wordFilter.findAllWords(content);
        updateWordFrequency(words);
    }
}

Multi‑level content audit

public class ContentAuditor {
    private SensitiveWordFilter highRiskFilter;
    private SensitiveWordFilter mediumRiskFilter;
    private SensitiveWordFilter lowRiskFilter;

    public AuditResult auditContent(String content) {
        AuditResult result = new AuditResult();
        if (!highRiskFilter.findAllWords(content).isEmpty()) {
            result.setStatus(AuditStatus.REJECT);
            result.setReason("Contains high‑risk keywords");
            return result;
        }
        if (!mediumRiskFilter.findAllWords(content).isEmpty()) {
            result.setStatus(AuditStatus.MANUAL_REVIEW);
            result.setReason("Contains medium‑risk keywords, needs manual review");
            return result;
        }
        if (!lowRiskFilter.findAllWords(content).isEmpty()) {
            String filtered = lowRiskFilter.filter(content, "***");
            result.setFilteredContent(filtered);
            result.setStatus(AuditStatus.PASS_WITH_FILTER);
        } else {
            result.setStatus(AuditStatus.PASS);
        }
        return result;
    }
}

Dynamic word‑list management (periodic reload)

@Service
public class SensitiveWordManager {
    private volatile SensitiveWordFilter filter;
    private ScheduledExecutorService updateExecutor = Executors.newSingleThreadScheduledExecutor();

    @PostConstruct
    public void init() {
        loadWords();
        updateExecutor.scheduleAtFixedRate(this::loadWords, 0, 1, TimeUnit.HOURS);
    }

    public void loadWords() {
        try {
            List<String> words = fetchLatestWords();
            this.filter = new SensitiveWordFilter(words);
            log.info("Sensitive‑word list refreshed, count: {}", words.size());
        } catch (Exception e) {
            log.error("Word‑list update failed", e);
        }
    }

    public boolean containsSensitiveWord(String text) {
        return filter != null && filter.containsSensitiveWord(text);
    }

    public String filterText(String text) {
        return filter != null ? filter.filter(text, "***") : text;
    }
}

Advanced optimization schemes

Double‑Array Trie

Compresses the Trie into two integer arrays ( base and check) to reduce memory consumption by 50‑80 % at the cost of more complex construction and limited dynamic updates. Suitable for very large offline dictionaries.

int[] base = new int[size];   // transition base values
int[] check = new int[size];  // parent state verification
// transition: next = base[current] + charCode
// if (check[next] == current) transition succeeds

Aho‑Corasick automaton

Extends the Trie with failure links, enabling a single pass to match all patterns simultaneously. Ideal for multi‑pattern log analysis or security scanning, though it consumes more memory and is harder to implement.

public class AhoCorasickAutomaton {
    private Node root = new Node();
    static class Node {
        Map<Character, Node> children = new HashMap<>();
        Node fail;                     // failure link
        List<String> output = new ArrayList<>(); // matched patterns
    }
    public void build(List<String> patterns) { /* build Trie then BFS to set fail links */ }
    public List<MatchResult> search(String text) { /* single‑pass matching */ }
    static class MatchResult { String pattern; int start; int end; }
}

Sharding + Bloom‑filter pre‑screening

Splits the keyword set across multiple filter instances and uses a Bloom filter to quickly discard texts that definitely contain no keywords, reducing unnecessary shard lookups. Provides horizontal scalability for high‑throughput services, at the expense of a small false‑positive rate.

public class ShardedFilter {
    private List<SensitiveWordFilter> shards;
    private BloomFilter<String> preFilter;

    public boolean containsSensitive(String text) {
        if (!preFilter.mightContain(text)) return false; // guaranteed miss
        int shardIdx = Math.abs(text.hashCode()) % shards.size();
        return shards.get(shardIdx).containsSensitiveWord(text);
    }
}

Repository

All example code is available at:

https://github.com/yuboon/java-examples/tree/master/springboot-dfa
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaalgorithmDFATrieSensitive Word Filtering
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.