BLM‑Guard: Explainable Multimodal Ad Moderation Using Chain‑of‑Thought and Policy‑Aligned RL

The paper introduces BLM‑Guard, an explainable multimodal ad‑moderation framework that combines interleaved‑modal chain‑of‑thought reasoning with a policy‑aligned reinforcement‑learning reward to detect hidden cross‑modal violations in short‑video ads, and presents a new benchmark that demonstrates state‑of‑the‑art performance across multiple risk scenarios.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
BLM‑Guard: Explainable Multimodal Ad Moderation Using Chain‑of‑Thought and Policy‑Aligned RL

Research Background

Short‑video advertising often hides violations across modalities: the visual stream may appear compliant while the audio or subtitles convey illicit messages. This cross‑modal mismatch demands a moderation system that is accurate, interpretable, and able to keep up with rapidly changing policy rules.

Technical Solution

Two‑Stage Training Paradigm

Stage 1 – Rule‑Anchored ICoT Cold‑Start synthesises training data with adaptive key‑frame sampling (AKS) and patch‑level localisation (InternViT‑6B). AKS selects representative frames by computing CLIP cosine similarity between each frame and a set of risk keywords (e.g., “false marketing”, “illegal content”), then applies a BIN+TOP strategy to ensure temporal coverage and semantic saliency. Patch‑level localisation extracts patch embeddings, computes L2 norms as saliency scores, and highlights critical regions such as subtitles or product close‑ups. Using a frozen InternVL‑3‑78B teacher, structured Interleaved‑modal Chain‑of‑Thought (ICoT) reasoning chains are generated in the form

visual‑location → risk‑screening → causal‑analysis → final‑decision

. The student model is trained with a KL‑regularised loss that forces the token distribution toward rule‑keyword priors, thereby embedding explicit rule knowledge into the model’s latent space.

Stage 2 – Self‑Adaptive Critique Reward (SCA‑R) Reinforcement Learning addresses policy drift. A guide model (e.g., GPT‑4o) dynamically constructs a scoring rubric based on the current policy and input, producing a self‑critique score. The reward function (GRPO – Group‑wise Relative Policy Optimization) combines three components:

Rule correctness (discrete reward: 1.0 if both scene and violation type are correct, 0.5 if only scene matches, otherwise 0).

Structural constraint (hard reward enforcing the presence of <think> and <answer> tags in the output).

SCA‑R (adaptive critique) – weighted by the guide model’s assessment.

The overall reward is a weighted sum of these terms, encouraging the model to produce logically consistent explanations and to remain aligned with evolving policies.

BLM‑Guard Benchmark

A multimodal ad‑risk dataset with a three‑level taxonomy (risk scene, violation type, severity) covering seven core scenarios (illegal content, false marketing, misleading operations, etc.). The benchmark enables fine‑grained evaluation of both binary decisions and reasoning consistency.

Performance

Evaluated on BLM‑Guard and five public datasets (including UCF), BLM‑Guard consistently outperforms strong baselines such as Qwen2.5‑VL and InternVL‑3‑8B. It raises strict accuracy by more than 20 % on high‑risk scenes and achieves a reasoning‑consistency score of 0.845, far above baseline levels of 0.5–0.6.

Performance comparison chart
Performance comparison chart

Ablation Study

Combining Rule‑anchored SFT with SCA‑R yields the best performance. Pure SFT suffers from hallucination, while adding SCA‑R makes the model more cautious under uncertainty and improves generalisation.

Ablation results
Ablation results

Future Outlook

Planned extensions include:

Developing a unified OneModel that jointly handles understanding and generation.

Releasing the KwaiBLM risk‑model foundation for broader adoption.

Building a multi‑agent RiskAgent system (RiskMatrix) for collaborative risk mitigation.

Enhancing deep‑fake detection and counter‑measures for AI‑generated content.

Exploring dynamic‑graph algorithms that combine Graph RAG with large‑model reasoning to capture hidden relational risks.

benchmarkchain of thoughtexplainable AIpolicy alignmentad risk detectionmultimodal moderation
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.