Artificial Intelligence 14 min read

Text Anti‑Spam Techniques and TextCNN Model for Real‑Time Spam Detection on the Huajiao Platform

This article introduces the Huajiao platform's text anti‑spam architecture, analyzes spam categories and challenges, compares rule‑based and machine‑learning approaches, details traditional NLP methods and the TextCNN deep‑learning model, provides its TensorFlow implementation, and describes the online deployment workflow.

360 Tech Engineering

Nov 13, 2019

Text Anti‑Spam Techniques and TextCNN Model for Real‑Time Spam Detection on the Huajiao Platform

The article presents a concise overview of Huajiao's text anti‑spam system, focusing on the techniques used to intercept malicious textual content such as advertising, pornographic, political, and counterfeit information.

As user volume grows, manual review becomes insufficient; therefore, a combination of rule‑based filters (keyword matching, regular expressions) and machine‑learning models is required to handle obfuscation tactics like pinyin substitution, synonym replacement, and emoji insertion.

Spam detection is framed as a binary text classification problem, evaluated with accuracy, precision, recall, and F1‑score. Traditional pipelines involve preprocessing (tokenization, stop‑word removal), feature extraction (Bag‑of‑Words, TF‑IDF, Word2Vec), and classifiers such as LR, SVM, MLP, and GBDT.

Because Chinese text often contains non‑standard characters ("火星文"), traditional tokenization fails, prompting the need for models that operate at the character level without segmentation.

TextCNN Principle

TextCNN adapts convolutional neural networks for NLP by treating each character as a token, using convolution kernels of varying widths to capture n‑gram features, and applying max‑pooling to retain the most salient signals. This architecture avoids tokenization, captures local order, and runs inference in under 50 ms.

Model Structure

The model embeds input characters, applies multiple 1‑D convolutions with different filter sizes, performs max‑pooling, concatenates the pooled vectors, and feeds them to a softmax output layer. The loss is computed with cross‑entropy plus L2 regularization.

#coding:utf-8
import tensorflow as tf
import numpy as np

class TextCNN(object):
    def __init__(self, sequence_length, num_classes, vocab_size, embedding_size,
                 filter_sizes, num_filters, l2_reg_lambda=0.0):
        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
        l2_loss = tf.constant(0.0)
        # Embedding
        with tf.device('/cpu:0'), tf.name_scope("embedding"):
            self.W = tf.get_variable('lookup_table',
                     dtype=tf.float32,
                     shape=[vocab_size, embedding_size],
                     initializer=tf.random_uniform_initializer())
            self.W = tf.concat((tf.zeros(shape=[1, embedding_size]), self.W[1:, :]), 0)
            self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
        # Convolution
        pooled_outputs = []
        for i, filter_size in enumerate(filter_sizes):
           with tf.name_scope("conv-maxpool-%s" % filter_size):
              filter_shape = [filter_size, embedding_size, 1, num_filters]
              W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
              b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
              conv = tf.nn.conv2d(self.embedded_chars_expanded, W, strides=[1,1,1,1],
                                        padding="VALID", name="conv")
              h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
              pooled = tf.nn.max_pool(h, ksize=[1, sequence_length - filter_size + 1, 1, 1],
                                            strides=[1,1,1,1], padding='VALID', name="pool")
              pooled_outputs.append(pooled)
        num_filters_total = num_filters * len(filter_sizes)
        self.h_pool = tf.concat(pooled_outputs, 3)
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
        # Output
        with tf.name_scope("output"):
            W = tf.get_variable("W", shape=[num_filters_total, num_classes],
                                    initializer=tf.contrib.layers.xavier_initializer())
            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
            l2_loss += tf.nn.l2_loss(W)
            l2_loss += tf.nn.l2_loss(b)
            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
            self.predictions = tf.argmax(self.scores, 1, name="predictions")
        # Loss
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
        # Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

The training results show high accuracy and fast inference, making the model suitable for real‑time spam detection.

Online Deployment Process

The service architecture separates online (millisecond‑level inference) and offline (model retraining) layers. TensorFlow Serving is used to host the model, exposing a gRPC interface for client calls. Clients send requests to the serving endpoint, receive predictions, and the system supports hot‑updates without downtime.

References include seminal works on CNNs for sentence classification, Stanford lecture slides, and official TensorFlow Serving documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN machine learning TensorFlow NLP Text Classification spam detection online deployment

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.