Artificial Intelligence 19 min read

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

Kuaishou Tech

May 13, 2025

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

Background and Motivation

Short‑video platforms now serve billions of daily users, making content moderation and quality recommendation critical challenges. Traditional rule‑based or static language‑model approaches struggle with the dynamic nature of low‑quality content, leading to high labeling costs and low accuracy.

KuaiMod Benchmark and Dataset

Kuaishou built the first short‑video content‑quality benchmark, collecting 1,000 real videos from its platform and labeling them across four major and fifteen fine‑grained low‑quality categories. The dataset is fully human‑annotated, cleaned, and publicly released for research.

Model Architecture

KuaiMod uses Kuaishou’s YuanQi multimodal foundation model as the base. The model processes video metadata (title, cover, frames, OCR/ASR text, comments) and generates quality judgments via chain‑of‑thought reasoning.

Chain‑of‑Thought Data Construction

Tag2CoT : For each video, the multimodal model receives the video data and the human‑assigned low‑quality tag, then produces a detailed reasoning chain that explains the tag.

CoT2Tag : The reasoning chain is structured into five stages—content extraction, analysis, intermediate check, user‑feedback analysis, and final judgment—providing a systematic format for training.

Offline Training: SFT + DPO

Training proceeds in two phases. In the Supervised Fine‑Tuning (SFT) stage, the model learns next‑token prediction on the constructed data, aligning video inputs with reasoning and judgments. In the Direct Preference Optimization (DPO) stage, the SFT model generates predictions on the training set, and samples where the prediction disagrees with human feedback are used as negative examples, while correct predictions serve as positives, refining the model’s decision boundary.

Online Update: Reinforcement Learning from User Feedback (RLUF)

The online loop treats the platform as an environment and KuaiMod as an agent. User actions (reports, dislikes, likes) generate reward signals. Misaligned cases are collected in real time, re‑labeled, and fed back into the training pipeline using the same SFT + DPO process, enabling daily model updates that adapt to emerging low‑quality content.

Evaluation Results

On the KuaiMod benchmark, the KuaiMod‑7B model achieved 92.4% overall accuracy, surpassing competing methods (RoBERTa, Intern‑VL, GPT‑4o, Perspective API) by up to 10%. Multimodal models consistently outperformed text‑only baselines, highlighting the importance of visual understanding for video quality tasks.

Large‑Scale Deployment

KuaiMod is deployed across Kuaishou’s main app, fast version, and curated feeds. A/B tests show over 20% reduction in user‑report rates without harming active user counts or watch time, and even modest gains in user engagement on the main site.

Future Directions: Three‑Layer Multimodal Strategy

The roadmap consists of:

Foundation Layer : Unified multimodal representation, adapter‑based visual tuning, streaming context modeling, and supervised fine‑tuning.

Advanced Cognition Layer : Retrieval‑augmented generation with knowledge graphs, complex reasoning over actions and emotions, and causal modeling of social signals.

Application Layer : Deployments for video tag structuring, caption generation, interest modeling, e‑commerce recommendation, and comment sentiment analysis.

These stages aim to move from academic prototypes to production‑ready AI capabilities that close the loop between model improvement and business value.

multimodal AI benchmark reinforcement learning short video content moderation KuaiMod

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Motivation

KuaiMod Benchmark and Dataset

Model Architecture

Chain‑of‑Thought Data Construction

Offline Training: SFT + DPO

Online Update: Reinforcement Learning from User Feedback (RLUF)

Evaluation Results

Large‑Scale Deployment

Future Directions: Three‑Layer Multimodal Strategy

Kuaishou Tech

How this landed with the community

Was this worth your time?

0 Comments

Offline Training: SFT + DPO