How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality
This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.
Background and Motivation
Short‑video platforms now serve billions of daily users, making content moderation and quality recommendation critical challenges. Traditional rule‑based or static language‑model approaches struggle with the dynamic nature of low‑quality content, leading to high labeling costs and low accuracy.
KuaiMod Benchmark and Dataset
Kuaishou built the first short‑video content‑quality benchmark, collecting 1,000 real videos from its platform and labeling them across four major and fifteen fine‑grained low‑quality categories. The dataset is fully human‑annotated, cleaned, and publicly released for research.
Model Architecture
KuaiMod uses Kuaishou’s YuanQi multimodal foundation model as the base. The model processes video metadata (title, cover, frames, OCR/ASR text, comments) and generates quality judgments via chain‑of‑thought reasoning.
Chain‑of‑Thought Data Construction
Tag2CoT : For each video, the multimodal model receives the video data and the human‑assigned low‑quality tag, then produces a detailed reasoning chain that explains the tag.
CoT2Tag : The reasoning chain is structured into five stages—content extraction, analysis, intermediate check, user‑feedback analysis, and final judgment—providing a systematic format for training.
Offline Training: SFT + DPO
Training proceeds in two phases. In the Supervised Fine‑Tuning (SFT) stage, the model learns next‑token prediction on the constructed data, aligning video inputs with reasoning and judgments. In the Direct Preference Optimization (DPO) stage, the SFT model generates predictions on the training set, and samples where the prediction disagrees with human feedback are used as negative examples, while correct predictions serve as positives, refining the model’s decision boundary.
Online Update: Reinforcement Learning from User Feedback (RLUF)
The online loop treats the platform as an environment and KuaiMod as an agent. User actions (reports, dislikes, likes) generate reward signals. Misaligned cases are collected in real time, re‑labeled, and fed back into the training pipeline using the same SFT + DPO process, enabling daily model updates that adapt to emerging low‑quality content.
Evaluation Results
On the KuaiMod benchmark, the KuaiMod‑7B model achieved 92.4% overall accuracy, surpassing competing methods (RoBERTa, Intern‑VL, GPT‑4o, Perspective API) by up to 10%. Multimodal models consistently outperformed text‑only baselines, highlighting the importance of visual understanding for video quality tasks.
Large‑Scale Deployment
KuaiMod is deployed across Kuaishou’s main app, fast version, and curated feeds. A/B tests show over 20% reduction in user‑report rates without harming active user counts or watch time, and even modest gains in user engagement on the main site.
Future Directions: Three‑Layer Multimodal Strategy
The roadmap consists of:
Foundation Layer : Unified multimodal representation, adapter‑based visual tuning, streaming context modeling, and supervised fine‑tuning.
Advanced Cognition Layer : Retrieval‑augmented generation with knowledge graphs, complex reasoning over actions and emotions, and causal modeling of social signals.
Application Layer : Deployments for video tag structuring, caption generation, interest modeling, e‑commerce recommendation, and comment sentiment analysis.
These stages aim to move from academic prototypes to production‑ready AI capabilities that close the loop between model improvement and business value.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
