How Baidu Scales Sensitive Word Detection to Tens of Millions with a Trie‑Based Service
This article explains the design and evolution of Baidu's word‑list service for content moderation, covering its background, multi‑layer architecture, management platform, strategy loading, matching workflow, performance optimizations for large texts, and future enhancements such as special‑character support and per‑business‑line deployment.
Background
The content‑moderation platform must detect sensitive words in articles. Business lines have different requirements for word types, matching modes (contain, strong‑filter, multi‑mode), actions (review or reject), and list sizes ranging from a few thousand to tens of millions of entries. The system must return results in sub‑second latency.
Overall Architecture
Word‑list Management : Business lines maintain their own word lists via a management platform. The platform stores lists in Elasticsearch for tokenized search and periodically generates BOS files for each line.
Service Layer : Exposes a unified matching API, handling authentication, rate limiting, result post‑processing, and routing requests to the appropriate strategy operators.
Strategy Operator Layer : Implements the matching logic (contain, strong‑filter, multi‑mode) and loads word lists into memory either through full refresh or real‑time incremental sync.
Infrastructure Services : Built on the GDP framework, deployed with Pandora; uses MySQL for metadata storage, Elasticsearch for search, BDRP for rate limiting and caching, and BOS for file transfer.
Word‑list Management Platform
The platform lets each business line create multiple word‑list groups, each configurable with attributes such as review type, sensitivity category, matching mode, effective position, exemption words, extension strategies, and expiration time. Supported operations include adding, editing, searching, batch import/export via Excel, and bulk updates.
Create a new word list and assign it to one or more business lines.
Copy existing lists across business lines.
Search word lists by ID, name, business line, or creation time using Elasticsearch tokenized search.
Batch add up to 3,000 entries at once or batch create up to 30,000 entries via Excel.
Unified Service Entry
The API gateway validates requests, enforces rate limits, forwards traffic to the appropriate cluster based on business line, and returns matched sensitive words together with their attributes.
Strategy Loading of Word Lists
Version 1: A single BOS file shared by all business lines, refreshed every 30 minutes – slow due to large file size.
Version 2: Separate BOS files per business line, loaded in parallel – refresh time reduced to ~5 minutes.
Version 3: Real‑time incremental sync every 10 seconds for up to ~10,000 incremental entries; larger increments fall back to the 5‑minute schedule.
The production system combines versions 2 and 3 to achieve both fast full loads and low‑latency incremental updates.
Word‑list File Format
BOS files are tab‑separated. Multi‑word entries are joined with '&'. Columns include:
Word ID
Word text
List ID
Multi‑word spacing
Expiration time
Review type (review / reject)
Match type (contain / filter / multi‑mode)
Business line
Effective position (title, body, etc.)
Sensitivity category
Extension strategy (case‑insensitive, order‑swap, etc.)
Exemption words
Matching Process
6.1 Matching Workflow
1. The request carries identifiers ( request_id, token, service_line) and the text to be matched.
2. For strong‑filter matching, the system extracts all possible combinations of Chinese characters, letters, numbers, and symbols.
3. The appropriate Trie (selected by business line and effective position) matches individual words, returning the word, its position, and length.
4. Using the match data, the system looks up word IDs and attribute rules from caches. It determines whether the hit is a direct contain/filter match or requires multi‑word validation, then returns the final result.
6.2 Large‑Text Timeout Solution
When processing articles with hundreds of thousands of characters, matching the whole body can take seconds. The optimization splits the body into chunks of ≤5,000 characters and matches them in parallel, reducing latency from ~20 seconds to ~50 milliseconds.
6.3 Trie Implementation
The Trie is built with Baidu’s open‑source C++ library dictmatch. The library stores the tree in two tables to reduce memory consumption. Construction is slower, but lookup is extremely fast, enabling sub‑10 ms response times for tens of millions of words.
Development & Future Work
Current limitations include lack of support for emojis and other special characters; future releases will extend the matching engine to handle these symbols. The service currently runs a shared instance for over 60 business lines, which consumes large memory and creates a single point of failure. Plans are under way to automate per‑business‑line cluster deployment to improve isolation and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
