How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms
The KuaiMM Conversation project introduces a multimodal large‑model‑driven dialogue system for Kuaishou, featuring the world‑first short‑video mixed‑dialogue dataset, a Chain‑of‑Thought interaction framework, and large‑scale industrial deployments that dramatically improve live‑stream comments and intelligent customer service.
Project Overview
Recent years have seen explosive interest in conversational interaction technologies across HCI and intelligent assistants. With the rise of multimodal large models (MLLM), dialogue systems are evolving from text‑only to integrated image, audio, and video modalities. Kuaishou’s short‑video ecosystem provides massive multimodal data, making multimodal conversational AI especially valuable for the platform.
Core Contributions
Multimodal Mixed‑Dialogue Dataset : The first short‑video‑driven dataset (KwaiChat) contains over 90,000 video samples, 4 dialogue types, 13 core topics, 6 content domains, and 30 verticals, plus derived SeriesVideoQA and GODBench datasets.
CoT‑Driven Multimodal Interaction Framework : A Chain‑of‑Thought (CoT) mechanism introduces explicit reasoning chains, boosting cross‑modal knowledge fusion, high‑quality response generation, topic‑drift control, and long‑context modeling.
Industrial Deployment : The technology has been integrated into Kuaishou live‑stream, intelligent customer service, and other core business scenarios, delivering significant performance gains.
Dataset Details
The dataset covers knowledge‑grounded dialogue, chitchat, Q&A, and emotional dialogue, with a fine‑grained task‑label hierarchy. It supports multiple languages (e.g., Portuguese, Indonesian, Spanish) and multi‑video scenarios such as short dramas and continuous live‑stream slices.
Technical Innovations
Dual Chain of Thought (DCoT) : A dual‑chain architecture (Event Chain + Temporal Chain) extracts key events and temporal relations across videos, reducing redundant computation and enhancing multi‑video reasoning.
Ripple‑of‑Thought (RoT) Reply Generation : Structured semantic reasoning produces replies that are accurate, helpful, and engaging, markedly improving comment relevance and interaction.
Contextformer for Multimodal Memory : Models long‑dialogue context by dynamically activating multimodal memory, mitigating information loss and hallucination in extended conversations.
Business Applications
Live‑Stream Comments : AI‑generated “quick comments” and context‑aware replies increase comment trigger rates and boost e‑commerce conversion.
Intelligent Customer Service : Multimodal knowledge‑aware Q&A handles product‑detail images and videos, delivering high‑information replies and improving resolution rates.
Multimodal Knowledge Q&A : Supports both text and image inputs, enabling accurate answers in complex shopping scenarios and increasing conversion.
Future Plans
While KuaiMM Conversation has achieved strong results, challenges remain for complex merchant queries. The team is building an MMAgent system that leverages large‑model planning and execution to create more intelligent task‑handling pipelines.
Related Papers
NAACL 2025 Findings
arXiv 2504.21435
arXiv 2505.11436
arXiv 2505.23121
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
