How Hi-Guard Improves Trustworthy Multimodal Content Moderation with Policy‑Aligned Reasoning
The Hi-Guard framework transforms content moderation by aligning multimodal models with policy rules through hierarchical prompting, a structured taxonomy, and soft‑margin reinforcement learning, achieving significant gains in accuracy, precision, recall, and explainability for large‑scale user‑generated content platforms.
Introduction
Content safety is a critical pillar of platform governance, requiring accurate detection of pornographic, violent, and other policy‑violating material. Traditional black‑box models struggle with complex semantics and rule alignment, prompting the need for a policy‑driven, explainable moderation system. The Hi‑Guard framework was proposed to address these challenges and was accepted at KDD 2026.
Key Challenges in Existing Moderation Pipelines
Rule‑standard deviation: Models learn from noisy labels rather than the underlying policy, causing drift from dynamic platform rules.
Opaque decision process: Black‑box scores lack verifiable evidence, creating a gap between model outputs and human reviewers.
Difficulty distinguishing similar rules: Models often confuse closely related categories (e.g., “over‑sexualized minors” vs. “inappropriate clothing”), leading to over‑ or under‑moderation.
Hi‑Guard Framework
2.1 Learning Rules Instead of Pure Data Fitting
Hi‑Guard employs hierarchical prompting to embed policy logic directly into the model’s reasoning. The model follows explicit prompts that encode rules and accumulated domain knowledge, enabling better generalization to unseen scenarios and rapid adaptation through prompt updates.
2.2 Hierarchical Taxonomy
The flat classification task is reformulated as a path‑prediction problem: Domain → Topic → Subtype → Behavior . By progressively narrowing the search space, the model focuses on fine‑grained features, improving classification precision from vague judgments to exact targeting.
2.3 Soft‑Margin Reward & GRPO
During optimization, Hi‑Guard adopts Group Relative Policy Optimization (GRPO) with a path‑aware soft‑margin reward:
Hierarchical penalties: Misclassifications to sibling categories receive lighter penalties, while cross‑domain errors incur heavier penalties.
Depth‑weighted penalties: Errors at deeper, finer levels are penalized more strongly, forcing the model to “think deeply” on difficult cases.
Experimental Results
3.1 Performance Gains
On zero‑shot tests for long‑tail and unseen categories, Hi‑Guard outperforms traditional supervised fine‑tuning (SFT) variants:
Overall accuracy improves by 12.13 % .
Precision on risky content rises by 14.02 % , and recall by 10.28 % .
3.2 Ablation Study
Injecting structured policy rules yields the largest performance boost, followed by the hierarchical labeling design.
3.3 Explainability via Chain‑of‑Thought
Hi‑Guard generates a structured reasoning trace ( <think>) before producing the final decision ( <answer>). In a case where a child’s photo contains a wine bottle, the model correctly identifies the bottle but dismisses drinking risk based on context, while still flagging inappropriate clothing, demonstrating nuanced, rule‑consistent judgment.
Conclusion and Future Work
Hi‑Guard validates a scalable moderation pipeline that combines reinforcement‑driven generative reasoning with policy alignment and hierarchical constraints. Future directions include dynamic “instruction‑tuned” moderation models that allow business teams to update policies instantly via prompt modifications, further advancing transparent and intelligent content governance.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
