How We Built a High‑Performance, Low‑Cost Content Moderation System with Trie + Aho‑Corasick
Faced with minutes‑long posting delays and exploding review costs in a fast‑growing social app, the team introduced 24‑hour shift staffing, a local blacklist stored in MySQL, an in‑memory Trie + Aho‑Corasick matcher, Redis‑driven hot updates and a machine‑audit fallback with a feedback loop, dramatically cutting latency, cost and false‑positives.
Background and Initial Problem
When the author joined a newly launched social project, the friend‑circle feature suffered minutes‑long posting delays because only two customer‑service staff performed all content reviews, and reviews stopped after work hours—the period when users are most active.
First Improvement – Shift Work
A two‑shift schedule was introduced, each shift with two staff covering morning‑afternoon and afternoon‑midnight, enabling 24‑hour review and dramatically reducing latency.
New Challenges After User Growth
Rapid user acquisition increased post, comment and private‑message volume, overwhelming manual reviewers and making third‑party machine audit (e.g., Shumei) too costly, error‑prone, and sometimes increasing workload.
Design Goals
Reduce machine‑audit call volume by handling as many cases locally as possible.
Maintain millisecond‑level user experience for posts, comments and messages.
Minimize false‑positive rates and provide a clear remediation path.
Support real‑time updates of the word list.
Keep overall audit cost controllable.
Local Blacklist Architecture
A MySQL table api_sensitive_words stores blacklist, whitelist and normal words with fields for type, category, source, status, hit count, etc.
CREATE TABLE `api_sensitive_words` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '自增ID',
`keyword` VARCHAR(255) NOT NULL COMMENT '敏感词',
`type` ENUM('BLACK','WHITE','NORMAL') DEFAULT 'NORMAL' COMMENT '类型: 黑名单/白名单/普通',
`category` ENUM('PORN','POLITICS','TERROR','AD','INSULT','OTHER') DEFAULT 'OTHER' COMMENT '分类',
`source` ENUM('HUMAN','VENDOR','AUDIT') DEFAULT 'HUMAN' COMMENT '来源',
`status` TINYINT(1) DEFAULT 1 COMMENT '状态: 1启用 0停用',
`hit_count` BIGINT DEFAULT 0 COMMENT '命中次数',
`updated_by` VARCHAR(64) DEFAULT NULL COMMENT '最后操作人',
`updated_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间',
PRIMARY KEY (`id`),
KEY `idx_keyword` (`keyword`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='敏感词表';High‑Performance Matching with Trie + Aho‑Corasick
MySQL fuzzy search and Elasticsearch were rejected due to poor performance, lack of index usage, high operational cost and difficulty handling fuzzy/emoji variants. A Trie (prefix tree) provides O(N) matching independent of dictionary size, and the Aho‑Corasick automaton adds failure pointers to avoid backtracking, achieving millisecond‑level detection even with tens of thousands of words.
package trie
import (
"container/list"
"sync"
)
type TrieNode struct {
children map[rune]*TrieNode
fail *TrieNode
isEnd bool
word string
}
type ACTrie struct {
root *TrieNode
mu sync.RWMutex
}
func NewACTrie() *ACTrie { return &ACTrie{root: &TrieNode{children: make(map[rune]*TrieNode)}} }
func (ac *ACTrie) Build(words []string) {
ac.mu.Lock()
defer ac.mu.Unlock()
ac.root = &TrieNode{children: make(map[rune]*TrieNode)}
for _, w := range words {
node := ac.root
for _, ch := range []rune(w) {
if node.children[ch] == nil {
node.children[ch] = &TrieNode{children: make(map[rune]*TrieNode)}
}
node = node.children[ch]
}
node.isEnd = true
node.word = w
}
ac.buildFailPointers()
}
func (ac *ACTrie) buildFailPointers() {
q := list.New()
for _, n := range ac.root.children {
n.fail = ac.root
q.PushBack(n)
}
for q.Len() > 0 {
e := q.Front()
q.Remove(e)
cur := e.Value.(*TrieNode)
for ch, child := range cur.children {
f := cur.fail
for f != nil && f.children[ch] == nil {
f = f.fail
}
if f == nil {
child.fail = ac.root
} else {
child.fail = f.children[ch]
}
q.PushBack(child)
}
}
}
func (ac *ACTrie) Match(text string) []string {
ac.mu.RLock()
defer ac.mu.RUnlock()
var res []string
node := ac.root
for _, ch := range []rune(text) {
for node != ac.root && node.children[ch] == nil {
node = node.fail
}
if node.children[ch] != nil {
node = node.children[ch]
}
tmp := node
for tmp != ac.root {
if tmp.isEnd {
res = append(res, tmp.word)
}
tmp = tmp.fail
}
}
return res
}Runtime Architecture
The authoritative word list lives in MySQL. Backend admin tools modify it; changes are written to Redis. Redis Pub/Sub publishes a sensitive:update event. Go services subscribe, rebuild the Trie + Aho‑Corasick engine in memory (typically within 100 ms – 1 s), and then process user requests.
┌────────────────────────┐
│ api_sensitive_words │ ← MySQL persistent storage
└──────────┬─────────────┘
│
Backend management UI
│
┌──────────▼─────────────┐
│ Redis cache │ ← latest word list
└──────────┬─────────────┘
│
Redis Pub/Sub (sensitive:update)
│
┌──────────▼─────────────┐
│ Go service – in‑memory Trie + Aho‑Corasick │
└──────────┬─────────────┘
│
User request → local match → block / report
│
▼
If no match → third‑party machine auditMachine‑Audit Fallback and Feedback Loop
Third‑party audit returns three statuses:
PASS – content is safe; it is allowed but the candidate words are recorded for later review.
REVIEW – suspicious; the content is allowed temporarily and the candidate words are stored for manual verification.
REJECT – definitely violating; the content is blocked and the candidate words are stored for verification.
Candidate words are stored in api_sensitive_candidates. After manual review, confirmed violations are added to the blacklist, confirmed false positives to the whitelist, and a Redis update triggers an immediate hot‑reload of the in‑memory engine.
CREATE TABLE `api_sensitive_candidates` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`keyword` VARCHAR(255) NOT NULL COMMENT '候选敏感词',
`vendor` VARCHAR(64) DEFAULT NULL COMMENT '机审来源',
`risk_level` TINYINT DEFAULT 1 COMMENT '1低 2中 3高',
`status` ENUM('PENDING','CONFIRMED','REJECTED') DEFAULT 'PENDING' COMMENT '状态',
`created_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='机审候选敏感词';Risk Scoring for Selective Machine Audit
Content is scored on several dimensions:
Obfuscation (emoji, spacing, pinyin, mixed characters) – +1 to +2.
External links or contact info – +2.
Similarity to blacklist terms (edit distance, phonetics) – +1 to +2.
Context weight (comment vs. nickname vs. private message, time of day) – +1 to +3.
Account/device signals (new account, frequent blocks, high‑frequency posting, same IP across accounts) – +1 to +2.
Template or bulk‑post patterns – +1.
Scores below T1 are passed directly, between T1 and T2 are sent for machine audit, and above T2 are blocked or routed to manual review.
Results
Machine‑audit calls reduced by >70%.
False‑positive rate dropped below 1%.
Customer‑service workload decreased by ~50%.
The system continuously self‑evolves as candidate words are fed back into the blacklist/whitelist.
Takeaways
The solution keeps the vast majority of traffic in‑memory for millisecond latency, saves cost by avoiding unnecessary third‑party calls, and improves accuracy through a closed feedback loop that turns machine‑audit results into local rule updates.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
