Artificial Intelligence 10 min read

Abusive Comment Detection Using TextCNN: A Strategy + Algorithm Approach

The article proposes a hybrid approach that first filters blacklist words and then classifies suspicious comments with a character-level TextCNN, achieving around 89% precision and 87% recall, demonstrating that simple convolutional networks outperform keyword filters and RNNs for short, noisy abusive Chinese text.

Tencent Cloud Developer

Mar 21, 2018

Abusive Comment Detection Using TextCNN: A Strategy + Algorithm Approach

This article presents a method for automatically detecting abusive comments in news articles using convolutional neural networks. The author addresses the challenge of moderating toxic user-generated content on online platforms, where traditional keyword-based filtering proves inadequate due to the creative ways users circumvent filters (e.g., character substitution, homophones, partial masking).

Problem Analysis: Keyword-based approaches face a trade-off between precision and recall. High-precision keyword selection results in insufficient coverage, while high-recall selection leads to excessive false positives. For instance, words like "他*的" (damn it) and "麻痹" (paralyzed) can be abusive or non-abusive depending on context.

Data Preparation: The model uses manually annotated data, with abusive words classified into two categories: (1) Blacklist words - terms that indicate abuse when present (e.g., "二*", "妈*"), and (2) Suspicious words - terms that often but not always indicate abuse (e.g., "垃圾", "*痹"). Classification is based on hit accuracy statistics.

Preprocessing: Three text segmentation approaches were compared: character-level, word-level (using jieba), and pinyin-level. Interestingly, character-level segmentation performed best because word segmentation tools often fail on abusive comments containing misspellings and deliberate character substitutions.

Model Architecture: The TextCNN model by Yoon Kim was adopted. The network uses embedding layers to convert text to matrices, convolutional layers with filters of various sizes (1/2/3/4/5/6/8) and counts (50/100/150/150/200/150/100), max pooling, and softmax output. Batch normalization was added to prevent gradient vanishing.

Experimental Results: Character-level TextCNN achieved 85.32% precision, 86.18% recall, and 0.86 F-score. The final model combines rule-based strategy with TextCNN: first checking blacklist words (immediate abuse判定), then suspicious words (pass to model if present), achieving 89.03% precision, 86.68% recall, and 0.88 F-score.

Key Insights: CNN outperforms RNN for this task because comments are short texts where long-term memory is less important. TextCNN outperforms Char-CNN due to its simpler structure and reduced overfitting risk with limited training data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning convolutional neural network NLP content moderation Text Classification TextCNN Abusive Comment Detection

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.