How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms

The KuaiMM Conversation project introduces a multimodal large‑model‑driven dialogue system for Kuaishou, featuring the world‑first short‑video mixed‑dialogue dataset, a Chain‑of‑Thought interaction framework, and large‑scale industrial deployments that dramatically improve live‑stream comments and intelligent customer service.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms

Project Overview

Recent years have seen explosive interest in conversational interaction technologies across HCI and intelligent assistants. With the rise of multimodal large models (MLLM), dialogue systems are evolving from text‑only to integrated image, audio, and video modalities. Kuaishou’s short‑video ecosystem provides massive multimodal data, making multimodal conversational AI especially valuable for the platform.

Core Contributions

Multimodal Mixed‑Dialogue Dataset : The first short‑video‑driven dataset (KwaiChat) contains over 90,000 video samples, 4 dialogue types, 13 core topics, 6 content domains, and 30 verticals, plus derived SeriesVideoQA and GODBench datasets.

CoT‑Driven Multimodal Interaction Framework : A Chain‑of‑Thought (CoT) mechanism introduces explicit reasoning chains, boosting cross‑modal knowledge fusion, high‑quality response generation, topic‑drift control, and long‑context modeling.

Industrial Deployment : The technology has been integrated into Kuaishou live‑stream, intelligent customer service, and other core business scenarios, delivering significant performance gains.

Dataset Details

The dataset covers knowledge‑grounded dialogue, chitchat, Q&A, and emotional dialogue, with a fine‑grained task‑label hierarchy. It supports multiple languages (e.g., Portuguese, Indonesian, Spanish) and multi‑video scenarios such as short dramas and continuous live‑stream slices.

Technical Innovations

Dual Chain of Thought (DCoT) : A dual‑chain architecture (Event Chain + Temporal Chain) extracts key events and temporal relations across videos, reducing redundant computation and enhancing multi‑video reasoning.

Ripple‑of‑Thought (RoT) Reply Generation : Structured semantic reasoning produces replies that are accurate, helpful, and engaging, markedly improving comment relevance and interaction.

Contextformer for Multimodal Memory : Models long‑dialogue context by dynamically activating multimodal memory, mitigating information loss and hallucination in extended conversations.

Business Applications

Live‑Stream Comments : AI‑generated “quick comments” and context‑aware replies increase comment trigger rates and boost e‑commerce conversion.

Intelligent Customer Service : Multimodal knowledge‑aware Q&A handles product‑detail images and videos, delivering high‑information replies and improving resolution rates.

Multimodal Knowledge Q&A : Supports both text and image inputs, enabling accurate answers in complex shopping scenarios and increasing conversion.

Future Plans

While KuaiMM Conversation has achieved strong results, challenges remain for complex merchant queries. The team is building an MMAgent system that leverages large‑model planning and execution to create more intelligent task‑handling pipelines.

Related Papers

NAACL 2025 Findings

arXiv 2504.21435

arXiv 2505.11436

arXiv 2505.23121

Multimodal Interaction Overview
Multimodal Interaction Overview
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

chain of thoughtconversation AIDatasetKuaishoumultimodal dialogue
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.