How to Build an Efficient Text Content Moderation System

This article details the design and implementation of a high‑performance text content moderation system, covering the end‑to‑end workflow, the core Aho‑Corasick multi‑pattern matching algorithm, its double‑array Trie optimization, memory and speed benchmarks, and practical deployment considerations for large‑scale news client platforms.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
How to Build an Efficient Text Content Moderation System

In the era of exponential growth of online text, a reliable moderation system is essential for maintaining a healthy and safe network environment. The article outlines a four‑stage design: (1) pre‑filtering to discard prohibited posts, (2) whitelist/blacklist verification, (3) regex‑based noise removal, (4) core machine‑review using an Aho‑Corasick (AC) automaton, and (5) manual review for suspicious cases, with the process terminating as soon as a pass or block decision is made.

The core algorithmic component is multi‑pattern matching. Single‑pattern matching uses BF or KMP, while the system adopts multi‑pattern matching to handle millions of sensitive keywords. The AC automaton, introduced by Aho and Corasick (1975), builds a Trie and adds failure links, enabling a single pass over the target string with time complexity O(n), where n is the length of the text.

Although the classic AC automaton offers high speed, it suffers from large memory consumption and low node utilization when the pattern set reaches hundreds of thousands. To address this, the authors implement a double‑array Trie version of the AC automaton. By converting the ordinary Trie into two parallel arrays ( base and check), they dramatically reduce memory usage while preserving the linear‑time matching advantage.

The construction process consists of three steps: (1) building the Trie, (2) creating failure pointers via breadth‑first search, and (3) compressing the Trie into a double‑array structure. The article walks through an example with patterns he, she, hers, and his, showing node insertion, shared prefixes, and failure‑link assignment in detail.

During matching, the system traverses the target string once, calculating the next node index as next = base[current] + char + 1 and verifying it with the check array. When a node represents the end of a pattern, the storeEmits method records the hit, including start index, end index, and associated metadata (e.g., sensitivity level).

After raw matches are collected, a two‑step filtering separates single‑keyword hits from composite (union) keywords. Single‑keyword hits are directly added to the result set, while union keywords are first stored in a temporary map and later validated against predefined composition rules to avoid false positives.

Finally, the system outputs a standardized result containing the matched keyword, its level, and its start and end positions.

Performance metrics from production show that the system can load and match a 200 k‑keyword library using only ~80 MB of RAM (a >70% reduction versus the classic AC automaton), achieve a P99 request latency of 3 ms, and handle billions of characters per day on a single server, meeting the high‑throughput demands of news client applications.

The article also discusses broader applicability: the double‑array AC automaton can be extended to spam filtering, intrusion detection, input‑method suggestions, and other high‑speed string‑processing tasks, while large language models are better suited for semantic understanding and generative tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backendalgorithm optimizationtext moderationAho-Corasickdouble-array triecontent filtering
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.