Building a High‑Performance Content Moderation System with Trie, Aho‑Corasick, Redis, and Go
This article details how to design and implement a scalable, low‑cost content moderation pipeline that combines a local Trie + Aho‑Corasick engine, Redis‑based hot‑updates, MySQL persistence, and third‑party machine‑review fallback to achieve millisecond‑level response, high accuracy, and controllable costs.
Project‑stage Review Mechanism
When I joined a new social project, the "Moments" feature was extremely slow, taking several minutes to display a post, which is fatal for user engagement. The root cause was the review mechanism: only two customer service agents handled all reviews (avatars, nicknames, moments) and they stopped working at night, while user activity peaked at night.
We introduced a shift‑based schedule with two agents per shift, covering morning‑to‑afternoon and afternoon‑to‑midnight, allowing remote reviews. This reduced latency dramatically.
New Problems After User Surge
Massive user growth caused review pressure to explode. Adding more agents still couldn't keep up, and relying solely on third‑party machine review was costly and error‑prone.
“The cost is too high, find a way to reduce it!”
We needed a technical solution to lower machine‑review calls while preserving user experience and detection accuracy.
Core Goals
Local review capacity insufficient
Machine‑review cost too high
Mis‑detections increase complaints and require double handling
Management demands cost reduction without hurting experience
We defined the following objectives:
Reduce machine‑review calls by intercepting obvious cases locally.
Guarantee sub‑second user experience for moments/comments.
Minimize complaint volume from false positives.
Enable real‑time rule updates via Redis Pub/Sub.
Keep third‑party review as a fallback, not the primary path.
Step 1: Build a Local Blacklist System
We created a MySQL table api_sensitive_words to store high‑confidence sensitive words (blacklist, whitelist, normal) with fields for type, category, source, status, hit count, etc. Indexes on keyword support fast Trie construction.
CREATE TABLE `api_sensitive_words` ( `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '自增ID', `keyword` VARCHAR(255) NOT NULL COMMENT '敏感词', `type` ENUM('BLACK','WHITE','NORMAL') DEFAULT 'NORMAL' COMMENT '类型: 黑名单/白名单/普通', `category` ENUM('PORN','POLITICS','TERROR','AD','INSULT','OTHER') DEFAULT 'OTHER' COMMENT '分类', `source` ENUM('HUMAN','VENDOR','AUDIT') DEFAULT 'HUMAN' COMMENT '来源', `status` TINYINT(1) DEFAULT 1 COMMENT '状态: 1启用 0停用', `hit_count` BIGINT DEFAULT 0 COMMENT '命中次数', `updated_by` VARCHAR(64) DEFAULT NULL COMMENT '最后操作人', `updated_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', PRIMARY KEY (`id`), KEY `idx_keyword` (`keyword`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='敏感词表';Key benefits:
Admins can flexibly maintain the word list.
Local Trie matches avoid any external request.
Future extensions for whitelist, hot‑updates, and mis‑detection handling.
Step 2: High‑Performance Matching with Trie + Aho‑Corasick
We implemented a Go library that builds a Trie from the word list and augments it with Aho‑Corasick failure pointers for O(N) matching regardless of dictionary size.
type TrieNode struct { children map[rune]*TrieNode fail *TrieNode isEnd bool word string }
type ACTrie struct { root *TrieNode mu sync.RWMutex }
func (ac *ACTrie) Build(words []string) { /* build Trie and failure links */ }
func (ac *ACTrie) Match(text string) []string { /* return all matched keywords */ }This approach handles fuzzy matches, homophones, and emoji variations efficiently.
Step 3: Architecture Design
The overall flow:
┌───────────────────────┐ MySQL Persistent Store
│ api_sensitive_words │← Store all words
└───────────┬───────────┘
│
Backend admin adds/updates words
│
┌────────▼─────────┐ Redis Cache
│ Redis Cache │← Store latest word list
└───────┬───────────┘
│
Redis Pub/Sub (sensitive:update)
│
┌───────▼───────────┐ Go Service In‑Memory Trie + AC
│ Go Trie Engine │← Real‑time matching
└───────┬───────────┘
│
User request → Local match →
├─ Hit → Block/Flag
└─ Miss → Optional third‑party auditAdvantages:
99% of requests are answered by in‑memory matching (milliseconds).
Only suspicious or unmatched content triggers third‑party audit, drastically cutting cost.
Hot‑updates via Redis ensure new rules take effect within seconds.
Step 4: Machine‑Review Fallback and Feedback Loop
When local matching misses, we call a vendor audit service (e.g., Shumei, Tianwang). The vendor returns three statuses:
Status
Meaning
Our Action
PASS
Content safe
Allow passage but record candidate words for later review.
REVIEW
Potential risk
Store in api_sensitive_candidates for manual verification.
REJECT
Definite violation
Block immediately and log candidate.
Candidate words are stored in api_sensitive_candidates with fields for vendor, risk level, and status (PENDING, CONFIRMED, REJECTED). After manual review:
Confirmed violations are added to the blacklist (type = BLACK) and the Trie is hot‑updated.
Confirmed false positives are added to the whitelist (type = WHITE) to prevent future blocks.
Step 5: Intelligent Word‑Library Evolution
We automate the evolution cycle:
Machine‑review → candidate table.
Human verification → blacklist/whitelist.
Redis publish → Go service rebuilds Trie instantly.
This creates a self‑learning system where each audit improves future detection.
Risk Scoring for Selective Machine Review
Not every unmatched request goes to the vendor. We compute a risk score based on:
Obfuscation patterns (spaces, emojis, phonetic variants) – +1‑2.
External links or contact info – +2.
Similarity to blacklist words – +1‑2.
Context weight (nickname, private message, time of day) – +1‑3.
Account/device signals (new account, rapid posting, IP sharing) – +1‑2.
Template or bulk‑post signatures – +1.
Thresholds T1/T2 decide:
Score < T1 → direct pass.
T1 ≤ Score < T2 → send to vendor for fallback.
Score ≥ T2 → block or route to manual review.
Additional safeguards include random human sampling and user‑report channels, feeding results back into the model.
Results and Benefits
Machine‑review calls reduced by >70%.
False‑positive rate dropped below 1%.
Customer‑service workload cut by ~50%.
System scales to high traffic with sub‑second latency.
Conclusion
The final architecture combines a fast in‑memory Trie + Aho‑Corasick engine, Redis‑driven hot updates, and a controlled machine‑review fallback with a feedback loop, delivering a cost‑effective, high‑performance content safety solution suitable for any large‑scale social platform.
For more AI‑coding resources, visit AI编程资讯AI编码专区指南 .
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
