Information Security 9 min read

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

The Hi-Guard framework transforms content moderation by aligning multimodal models with policy rules through hierarchical prompting, a structured taxonomy, and soft‑margin reinforcement learning, achieving significant gains in accuracy, precision, recall, and explainability for large‑scale user‑generated content platforms.

Xiaohongshu Tech REDtech

Jan 15, 2026

How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning

Introduction

Content safety is a critical pillar of platform governance, requiring accurate detection of pornographic, violent, and other policy‑violating material. Traditional black‑box models struggle with complex semantics and rule alignment, prompting the need for a policy‑driven, explainable moderation system. The Hi‑Guard framework was proposed to address these challenges and was accepted at KDD 2026.

Key Challenges in Existing Moderation Pipelines

Rule‑standard deviation: Models learn from noisy labels rather than the underlying policy, causing drift from dynamic platform rules.

Opaque decision process: Black‑box scores lack verifiable evidence, creating a gap between model outputs and human reviewers.

Difficulty distinguishing similar rules: Models often confuse closely related categories (e.g., “over‑sexualized minors” vs. “inappropriate clothing”), leading to over‑ or under‑moderation.

Hi‑Guard Framework

2.1 Learning Rules Instead of Pure Data Fitting

Hi‑Guard employs hierarchical prompting to embed policy logic directly into the model’s reasoning. The model follows explicit prompts that encode rules and accumulated domain knowledge, enabling better generalization to unseen scenarios and rapid adaptation through prompt updates.

2.2 Hierarchical Taxonomy

The flat classification task is reformulated as a path‑prediction problem: Domain → Topic → Subtype → Behavior . By progressively narrowing the search space, the model focuses on fine‑grained features, improving classification precision from vague judgments to exact targeting.

2.3 Soft‑Margin Reward & GRPO

During optimization, Hi‑Guard adopts Group Relative Policy Optimization (GRPO) with a path‑aware soft‑margin reward:

Hierarchical penalties: Misclassifications to sibling categories receive lighter penalties, while cross‑domain errors incur heavier penalties.

Depth‑weighted penalties: Errors at deeper, finer levels are penalized more strongly, forcing the model to “think deeply” on difficult cases.

Experimental Results

3.1 Performance Gains

On zero‑shot tests for long‑tail and unseen categories, Hi‑Guard outperforms traditional supervised fine‑tuning (SFT) variants:

Overall accuracy improves by 12.13 % .

Precision on risky content rises by 14.02 % , and recall by 10.28 % .

3.2 Ablation Study

Injecting structured policy rules yields the largest performance boost, followed by the hierarchical labeling design.

3.3 Explainability via Chain‑of‑Thought

Hi‑Guard generates a structured reasoning trace ( <think>) before producing the final decision ( <answer>). In a case where a child’s photo contains a wine bottle, the model correctly identifies the bottle but dismisses drinking risk based on context, while still flagging inappropriate clothing, demonstrating nuanced, rule‑consistent judgment.

Conclusion and Future Work

Hi‑Guard validates a scalable moderation pipeline that combines reinforcement‑driven generative reasoning with policy alignment and hierarchical constraints. Future directions include dynamic “instruction‑tuned” moderation models that allow business teams to update policies instantly via prompt modifications, further advancing transparent and intelligent content governance.

multimodal AI reinforcement learning Explainability content moderation hierarchical labeling policy alignment

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.