Backend Development 17 min read

Building a High‑Performance Sensitive‑Word Filter with SpringBoot and DFA

This article explains why traditional string‑search and regex methods struggle with large keyword sets, introduces the deterministic finite automaton (DFA) approach using a Trie structure for linear‑time matching, provides full Java implementations, and discusses real‑world applications and advanced optimizations such as double‑array Tries, Aho‑Corasick, and sharding with Bloom filters.

Java Companion

Dec 18, 2025

Building a High‑Performance Sensitive‑Word Filter with SpringBoot and DFA

Introduction

Sensitive‑word filtering is essential for social platforms, e‑commerce, and gaming to maintain a healthy content ecosystem. Traditional brute‑force string search and regular‑expression matching degrade sharply as the keyword list grows.

Why DFA Is Needed

Brute‑force matching has a time complexity of O(n × m × k), causing seconds‑long delays for a 1,000‑character article when the dictionary reaches thousands of words. Regular expressions also suffer from long compilation times, high memory usage, and the need to recompile when keywords change.

DFA Advantages

Linear time complexity: O(n) – the text is scanned once, independent of keyword count.

Space sharing: common prefixes are stored once, dramatically reducing memory.

Deterministic matching: no backtracking, unlike regex.

Dynamic extensibility: keywords can be added or removed at runtime without rebuilding the whole structure.

In a test where the keyword set grew from 1,000 to 10,000 entries, brute‑force time increased about tenfold, while DFA time remained almost unchanged.

DFA Principle

A DFA (Deterministic Finite Automaton) is defined as a five‑tuple M = (Q, Σ, δ, q₀, F) where Q is a finite set of states, Σ the input alphabet, δ: Q × Σ → Q the transition function, q₀ the start state, and F the set of accepting states. In a filter, each state represents a position in a potential keyword.

Trie Visualization

Consider the keyword list ["apple", "app", "application", "apply", "orange"]. The corresponding Trie looks like:

root
├── a
│   └── p
│       └── p [end] ← "app"
│           └── l
│               ├── e [end] ← "apple"
│               ├── i
│               │   └── c
│               │       └── a
│               │           └── t
│               │               └── i
│               │                   └── o
│               │                       └── n [end] ← "application"
│               └── y [end] ← "apply"
└── o
    └── r
        └── a
            └── n
                └── g
                    └── e [end] ← "orange"

Key features:

Prefix sharing: "app" is stored once for all words that share it.

State paths: each branch represents a character transition; [end] marks an accepting state.

Space efficiency: five words require only ten character nodes instead of 29 separate characters.

Lookup efficiency: finding "application" needs 11 character comparisons, "app" needs only 3.

Core Implementation

TrieNode defines the node structure:

public class TrieNode {
    // child map: character → TrieNode
    private Map<Character, TrieNode> children = new HashMap<>();
    // marks end of a keyword
    private boolean isEnd = false;
    // stores the full keyword for output
    private String keyword;
    public TrieNode getChild(char c) { return children.get(c); }
    public TrieNode addChild(char c) { return children.computeIfAbsent(c, k -> new TrieNode()); }
    public boolean hasChild(char c) { return children.containsKey(c); }
    // getters and setters omitted
}

SensitiveWordFilter contains the DFA logic:

public class SensitiveWordFilter {
    private TrieNode root;
    private int minWordLength = 1;
    public SensitiveWordFilter(List<String> sensitiveWords) {
        this.root = buildTrie(sensitiveWords);
        this.minWordLength = sensitiveWords.stream().mapToInt(String::length).min().orElse(1);
    }
    private TrieNode buildTrie(List<String> words) {
        TrieNode root = new TrieNode();
        for (String word : words) {
            TrieNode node = root;
            for (char c : word.toCharArray()) {
                node = node.addChild(c);
            }
            node.setEnd(true);
            node.setKeyword(word);
        }
        return root;
    }
    public boolean containsSensitiveWord(String text) {
        if (text == null || text.length() < minWordLength) return false;
        char[] chars = text.toCharArray();
        for (int i = 0; i < chars.length; i++) {
            if (dfaMatch(chars, i)) return true;
        }
        return false;
    }
    private boolean dfaMatch(char[] chars, int start) {
        TrieNode node = root;
        for (int i = start; i < chars.length; i++) {
            char c = chars[i];
            if (!node.hasChild(c)) break; // transition fails
            node = node.getChild(c);
            if (node.isEnd()) return true; // accepting state
        }
        return false;
    }
    public List<SensitiveWordResult> findAllWords(String text) { /* omitted for brevity */ }
    public String filter(String text, String replacement) { /* omitted for brevity */ }
}

SensitiveWordResult stores a match:

public class SensitiveWordResult {
    private String word; // matched keyword
    private int start;    // start index
    private int end;      // end index
    public SensitiveWordResult(String word, int start, int end) {
        this.word = word; this.start = start; this.end = end;
    }
    // getters and toString omitted
}

Practical Scenarios

ChatMessageFilter demonstrates asynchronous filtering in a high‑concurrency chat system:

public class ChatMessageFilter {
    private SensitiveWordFilter wordFilter;
    private ExecutorService filterExecutor = Executors.newFixedThreadPool(10);
    public CompletableFuture<Message> filterMessageAsync(Message message) {
        return CompletableFuture.supplyAsync(() -> {
            String content = message.getContent();
            if (wordFilter.containsSensitiveWord(content)) {
                String filtered = wordFilter.filter(content, "***");
                message.setContent(filtered);
                recordSensitiveWords(content);
            }
            return message;
        }, filterExecutor);
    }
    private void recordSensitiveWords(String content) { /* statistics */ }
}

ContentAuditor applies multi‑level risk filters (high, medium, low) and returns an audit result with status and reason.

SensitiveWordManager shows dynamic word‑library management with a scheduled task that reloads the dictionary from a database or configuration center.

Advanced Optimizations

Double‑Array Trie compresses the Trie into two integer arrays ( base and check) to cut memory usage by 50‑80 % at the cost of higher construction complexity and difficult runtime updates. Suitable for large offline word banks.

// simplified double‑array layout
int[] base = new int[size];   // state transition base value
int[] check = new int[size];  // state transition check value
// transition: next = base[current] + charCode
// if check[next] == current → transition succeeds

Aho‑Corasick Automaton adds failure pointers to the Trie, enabling a single pass to match multiple patterns simultaneously.

public class AhoCorasickAutomaton {
    private Node root = new Node();
    static class Node {
        Map<Character, Node> children = new HashMap<>();
        Node fail;                     // failure link
        List<String> output = new ArrayList<>(); // matched patterns
    }
    public void build(List<String> patterns) { /* build Trie then BFS to set fail links */ }
    public List<MatchResult> search(String text) { /* single‑pass matching */ }
    static class MatchResult { String pattern; int start; int end; /* ctor */ }
}

Sharding + Bloom Filter partitions the keyword set across multiple filter instances and uses a Bloom filter to quickly discard texts that definitely contain no keywords, reducing unnecessary scans.

public class ShardedFilter {
    private List<SensitiveWordFilter> shards;
    private BloomFilter<String> preFilter;
    public boolean containsSensitive(String text) {
        if (!preFilter.mightContain(text)) return false; // guaranteed miss
        int shardIndex = Math.abs(text.hashCode()) % shards.size();
        return shards.get(shardIndex).containsSensitiveWord(text);
    }
}

Conclusion

DFA, implemented via a Trie, delivers millisecond‑level response for real‑time sensitive‑word filtering while keeping memory usage low. Depending on the scale and latency requirements, developers can adopt the basic DFA, a double‑array Trie, an Aho‑Corasick automaton, or a sharded Bloom‑filter architecture.