Text Anti‑Spam Techniques and TextCNN Model for Real‑Time Spam Detection on the Huajiao Platform
This article introduces the Huajiao platform's text anti‑spam architecture, analyzes spam categories and challenges, compares rule‑based and machine‑learning approaches, details traditional NLP methods and the TextCNN deep‑learning model, provides its TensorFlow implementation, and describes the online deployment workflow.
The article presents a concise overview of Huajiao's text anti‑spam system, focusing on the techniques used to intercept malicious textual content such as advertising, pornographic, political, and counterfeit information.
As user volume grows, manual review becomes insufficient; therefore, a combination of rule‑based filters (keyword matching, regular expressions) and machine‑learning models is required to handle obfuscation tactics like pinyin substitution, synonym replacement, and emoji insertion.
Spam detection is framed as a binary text classification problem, evaluated with accuracy, precision, recall, and F1‑score. Traditional pipelines involve preprocessing (tokenization, stop‑word removal), feature extraction (Bag‑of‑Words, TF‑IDF, Word2Vec), and classifiers such as LR, SVM, MLP, and GBDT.
Because Chinese text often contains non‑standard characters ("火星文"), traditional tokenization fails, prompting the need for models that operate at the character level without segmentation.
TextCNN Principle
TextCNN adapts convolutional neural networks for NLP by treating each character as a token, using convolution kernels of varying widths to capture n‑gram features, and applying max‑pooling to retain the most salient signals. This architecture avoids tokenization, captures local order, and runs inference in under 50 ms.
Model Structure
The model embeds input characters, applies multiple 1‑D convolutions with different filter sizes, performs max‑pooling, concatenates the pooled vectors, and feeds them to a softmax output layer. The loss is computed with cross‑entropy plus L2 regularization.
#coding:utf-8 import tensorflow as tf import numpy as np class TextCNN(object): def __init__(self, sequence_length, num_classes, vocab_size, embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y") self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") l2_loss = tf.constant(0.0) # Embedding with tf.device('/cpu:0'), tf.name_scope("embedding"): self.W = tf.get_variable('lookup_table', dtype=tf.float32, shape=[vocab_size, embedding_size], initializer=tf.random_uniform_initializer()) self.W = tf.concat((tf.zeros(shape=[1, embedding_size]), self.W[1:, :]), 0) self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1) # Convolution pooled_outputs = [] for i, filter_size in enumerate(filter_sizes): with tf.name_scope("conv-maxpool-%s" % filter_size): filter_shape = [filter_size, embedding_size, 1, num_filters] W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1], padding="VALID", name="conv") h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") pooled = tf.nn.max_pool(h, ksize=[1, sequence_length - filter_size + 1, 1, 1], strides=[1,1,1,1], padding='VALID', name="pool") pooled_outputs.append(pooled) num_filters_total = num_filters * len(filter_sizes) self.h_pool = tf.concat(pooled_outputs, 3) self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) with tf.name_scope("dropout"): self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) # Output with tf.name_scope("output"): W = tf.get_variable("W", shape=[num_filters_total, num_classes], initializer=tf.contrib.layers.xavier_initializer()) b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") l2_loss += tf.nn.l2_loss(W) l2_loss += tf.nn.l2_loss(b) self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") self.predictions = tf.argmax(self.scores, 1, name="predictions") # Loss with tf.name_scope("loss"): losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y) self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss # Accuracy with tf.name_scope("accuracy"): correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
The training results show high accuracy and fast inference, making the model suitable for real‑time spam detection.
Online Deployment Process
The service architecture separates online (millisecond‑level inference) and offline (model retraining) layers. TensorFlow Serving is used to host the model, exposing a gRPC interface for client calls. Clients send requests to the serving endpoint, receive predictions, and the system supports hot‑updates without downtime.
References include seminal works on CNNs for sentence classification, Stanford lecture slides, and official TensorFlow Serving documentation.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.