Artificial Intelligence 16 min read

Multimodal Advertisement Detection System for WeChat "KanKan" Articles

This article introduces a multimodal advertisement detection framework for WeChat KanKan that decomposes the problem into text, image, and article‑structure dimensions, presents novel models for ad text and image recognition, and describes how sequence classification and visualisation are used to filter severe ad‑spam articles.

DataFunTalk

Aug 14, 2021

Multimodal Advertisement Detection System for WeChat "KanKan" Articles

Introduction This article presents the design of a multimodal advertisement detection system for WeChat KanKan, starting from problem definition and breaking it down into textual, visual, and article‑structure dimensions. It explains how models are used to locate ad regions and how structural features are visualised for a complete filtering solution.

Background In the WeChat ecosystem, massive amounts of article data are generated daily, including a small proportion of low‑quality or spam articles. Advertising articles constitute the largest spam sub‑type, prompting the construction of a dedicated ad‑recognition pipeline to filter severe ad content.

Problem Challenges The system must identify both ad text and ad images, locate their positions, and consider overall article structure and proportion. Simple binary classification is insufficient because ads can appear at various positions (top, middle, bottom) with varying sizes and frequencies, and the model must remain explainable and maintainable.

System Framework The proposed framework combines text, image, and structural features in a multimodal architecture. First, ad‑text detection and ad‑image recognition models identify and locate ad regions; then a sequence model that incorporates article‑structure features makes a final judgment on whether the article is severe ad spam. The overall architecture is illustrated in the figure below.

Ad Text Detection Most severe ads consist mainly of ad text, which can be handled by traditional text classification (e.g., LR+TFIDF). However, inserted ads occupy less than 10% of the article and require fine‑grained localization. The authors propose the TADL model, which slides a window over the text, scores each fragment for ad probability, and uses a max‑pooling operation to train with only article‑level labels while still providing fragment‑level predictions.

Challenges in TADL

How to achieve detection with only article‑level annotations? (Solution: max‑pooling over fragment scores.)

Balancing local and global information (Solution: incorporate Transformer‑style position embeddings.)

Supporting ultra‑long texts (Solution: segment‑wise inference and batch sequence padding.)

Ad Image Recognition Advertising images are diverse (models, product shots, QR codes, embedded text). A pure end‑to‑end deep model lacks interpretability, so the authors adopt a wide‑&‑deep architecture with extensive feature engineering: object detection, scene recognition, OCR, and QR‑code detection, followed by multi‑head attention to fuse visual and textual cues.

Ad Article Sequence Classification By concatenating ad‑text fragment probabilities and ad‑image probabilities in article order, the multimodal problem is transformed into a sequence classification task. Visualisation of the probability sequence reveals typical patterns of severe ads (e.g., large top‑position ads). A BiLSTM + CNN hybrid extracts temporal trends and local spikes, and the combined features are fed to a classifier.

Further Challenges and Optimisations The authors discuss two later bottlenecks: (1) inaccurate position features causing false positives, solved by unifying multimodal position encoding; (2) performance degradation on very long sequences, mitigated by replacing BiLSTM with a dense layer for the final weighting, dramatically reducing inference time without sacrificing accuracy.

Summary and Reflections The system evolved from rule‑based methods to feature‑driven models, highlighting the importance of iterative feature engineering, problem decomposition, cross‑domain technique transfer (e.g., image detection tricks applied to NLP), and the trade‑off between model effectiveness and operational performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

image classification Multimodal AI content moderation WeChat Text Classification advertisement detection

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.