Artificial Intelligence 14 min read

Text Anti‑Spam Techniques and TextCNN Model for Real‑Time Spam Detection on the Huajiao Platform

This article introduces the Huajiao platform's text anti‑spam architecture, analyzes spam categories and challenges, compares rule‑based and machine‑learning approaches, details traditional NLP methods and the TextCNN deep‑learning model, provides its TensorFlow implementation, and describes the online deployment workflow.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Text Anti‑Spam Techniques and TextCNN Model for Real‑Time Spam Detection on the Huajiao Platform

The article presents a concise overview of Huajiao's text anti‑spam system, focusing on the techniques used to intercept malicious textual content such as advertising, pornographic, political, and counterfeit information.

As user volume grows, manual review becomes insufficient; therefore, a combination of rule‑based filters (keyword matching, regular expressions) and machine‑learning models is required to handle obfuscation tactics like pinyin substitution, synonym replacement, and emoji insertion.

Spam detection is framed as a binary text classification problem, evaluated with accuracy, precision, recall, and F1‑score. Traditional pipelines involve preprocessing (tokenization, stop‑word removal), feature extraction (Bag‑of‑Words, TF‑IDF, Word2Vec), and classifiers such as LR, SVM, MLP, and GBDT.

Because Chinese text often contains non‑standard characters ("火星文"), traditional tokenization fails, prompting the need for models that operate at the character level without segmentation.

TextCNN Principle

TextCNN adapts convolutional neural networks for NLP by treating each character as a token, using convolution kernels of varying widths to capture n‑gram features, and applying max‑pooling to retain the most salient signals. This architecture avoids tokenization, captures local order, and runs inference in under 50 ms.

Model Structure

The model embeds input characters, applies multiple 1‑D convolutions with different filter sizes, performs max‑pooling, concatenates the pooled vectors, and feeds them to a softmax output layer. The loss is computed with cross‑entropy plus L2 regularization.

#coding:utf-8 import tensorflow as tf import numpy as np class TextCNN(object): def __init__(self, sequence_length, num_classes, vocab_size, embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x") self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y") self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") l2_loss = tf.constant(0.0) # Embedding with tf.device('/cpu:0'), tf.name_scope("embedding"): self.W = tf.get_variable('lookup_table', dtype=tf.float32, shape=[vocab_size, embedding_size], initializer=tf.random_uniform_initializer()) self.W = tf.concat((tf.zeros(shape=[1, embedding_size]), self.W[1:, :]), 0) self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1) # Convolution pooled_outputs = [] for i, filter_size in enumerate(filter_sizes): with tf.name_scope("conv-maxpool-%s" % filter_size): filter_shape = [filter_size, embedding_size, 1, num_filters] W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b") conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1], padding="VALID", name="conv") h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") pooled = tf.nn.max_pool(h, ksize=[1, sequence_length - filter_size + 1, 1, 1], strides=[1,1,1,1], padding='VALID', name="pool") pooled_outputs.append(pooled) num_filters_total = num_filters * len(filter_sizes) self.h_pool = tf.concat(pooled_outputs, 3) self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) with tf.name_scope("dropout"): self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) # Output with tf.name_scope("output"): W = tf.get_variable("W", shape=[num_filters_total, num_classes], initializer=tf.contrib.layers.xavier_initializer()) b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") l2_loss += tf.nn.l2_loss(W) l2_loss += tf.nn.l2_loss(b) self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") self.predictions = tf.argmax(self.scores, 1, name="predictions") # Loss with tf.name_scope("loss"): losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y) self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss # Accuracy with tf.name_scope("accuracy"): correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1)) self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

The training results show high accuracy and fast inference, making the model suitable for real‑time spam detection.

Online Deployment Process

The service architecture separates online (millisecond‑level inference) and offline (model retraining) layers. TensorFlow Serving is used to host the model, exposing a gRPC interface for client calls. Clients send requests to the serving endpoint, receive predictions, and the system supports hot‑updates without downtime.

References include seminal works on CNNs for sentence classification, Stanford lecture slides, and official TensorFlow Serving documentation.

CNNMachine LearningTensorFlowNLPText Classificationspam detectiononline deployment
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.