Artificial Intelligence 17 min read

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

RAVEN is a reinforcement‑reasoning framework that combines curriculum learning with hierarchical rewards to enable multimodal large language models to accurately locate and classify violation segments in advertisement videos, even under noisy, large‑scale industrial data.

Tencent Advertising Technology

Dec 25, 2025

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

Background and Challenges

Detecting violations in advertisement videos requires pinpointing the exact time span of each offending segment and correctly classifying its type. Traditional small‑scale supervised models struggle with noisy annotations and poor generalisation, while fine‑tuned multimodal LLMs are sensitive to label noise, suffer catastrophic forgetting, and lack explicit reasoning.

RAVEN Overview

The Tencent Advertising Technology team proposes RAVEN (Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning), a multimodal LLM framework that activates temporal reasoning without relying on manually annotated reasoning data. RAVEN integrates curriculum reinforcement learning and a hierarchical reward mechanism to achieve precise temporal grounding and robust classification.

Core Innovations

Curriculum Reinforcement Learning : A three‑stage training pipeline (precise data → coarse data → full dataset) gradually increases task difficulty, stabilising learning on noisy industrial data.

Hierarchical Rewards : Combines format rewards (enforcing <think> and <answer> structures), accuracy rewards (temporal IoU, boundary alignment, category consistency), and dynamic weighting across stages.

Structured Reasoning Mechanism : The model generates a full reasoning chain inside <think> tags and outputs a structured result inside <answer>, providing interpretability and logical consistency.

Reward Design Details

Format Reward

Reasoning must be enclosed in <think> tags.

Final answer must follow the pattern <answer>{category:..., interval:...}</answer>.

Temporal keywords "temporal start" and "temporal end" are required.

Accuracy Rewards

Temporal IoU Reward : Measures overlap between predicted and ground‑truth intervals.

Boundary Alignment Reward : Encourages exact start/end matching.

Category Consistency Reward : Ensures predicted violation categories match the ground truth.

Curriculum Training Stages

Stage 1 – Precise Annotations : Train on a small, accurately labelled subset. Rewards focus on all three accuracy components.

Stage 2 – Coarse Annotations : Train on large, noisy data. Rewards simplify to overall position and boundary alignment.

Stage 3 – Full Dataset Fine‑tuning : Combine precise and coarse data; balance all reward components for robust performance.

Experimental Validation

Offline Tests

RAVEN was compared against baselines LLaVA‑v1.5, Qwen2‑VL‑7B, and Qwen2.5‑VL‑7B (including their supervised‑fine‑tuned versions). It achieved superior violation‑category accuracy and temporal‑grounding precision, demonstrating the effectiveness of curriculum RL for robustness.

Online A/B Tests

Deployed on Tencent’s ad‑review platform with 20% traffic. RAVEN outperformed a smaller model and Qwen2.5‑VL‑7B‑SFT, improving category precision/recall and achieving an 8.5% higher interval‑accuracy.

Generalisation Study

RL‑trained RAVEN retained broader capabilities compared to SFT models, achieving higher accuracy on out‑of‑domain violation categories (e.g., low‑quality content, prohibited goods).

Ablation of Rewards and Curriculum

Removing format or boundary‑alignment rewards reduced performance, confirming their importance. Excluding the curriculum stage caused a 4.7% drop in temporal IoU, highlighting the necessity of progressive learning.

Conclusion

RAVEN demonstrates that combining curriculum reinforcement learning with hierarchical rewards enables multimodal LLMs to perform robust temporal grounding of ad‑video violations without extensive human‑annotated reasoning data. The framework achieves state‑of‑the‑art accuracy, mitigates catastrophic forgetting, and offers a scalable solution for real‑world ad compliance systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Reinforcement Learning Curriculum Learning multimodal LLM video moderation temporal grounding

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.