Artificial Intelligence 11 min read

How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms

The KuaiMM Conversation project introduces a multimodal large‑model‑driven dialogue system for Kuaishou, featuring the world‑first short‑video mixed‑dialogue dataset, a Chain‑of‑Thought interaction framework, and large‑scale industrial deployments that dramatically improve live‑stream comments and intelligent customer service.

Kuaishou Tech

Jul 16, 2025

How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms

Project Overview

Recent years have seen explosive interest in conversational interaction technologies across HCI and intelligent assistants. With the rise of multimodal large models (MLLM), dialogue systems are evolving from text‑only to integrated image, audio, and video modalities. Kuaishou’s short‑video ecosystem provides massive multimodal data, making multimodal conversational AI especially valuable for the platform.

Core Contributions

Multimodal Mixed‑Dialogue Dataset : The first short‑video‑driven dataset (KwaiChat) contains over 90,000 video samples, 4 dialogue types, 13 core topics, 6 content domains, and 30 verticals, plus derived SeriesVideoQA and GODBench datasets.

CoT‑Driven Multimodal Interaction Framework : A Chain‑of‑Thought (CoT) mechanism introduces explicit reasoning chains, boosting cross‑modal knowledge fusion, high‑quality response generation, topic‑drift control, and long‑context modeling.

Industrial Deployment : The technology has been integrated into Kuaishou live‑stream, intelligent customer service, and other core business scenarios, delivering significant performance gains.

Dataset Details

The dataset covers knowledge‑grounded dialogue, chitchat, Q&A, and emotional dialogue, with a fine‑grained task‑label hierarchy. It supports multiple languages (e.g., Portuguese, Indonesian, Spanish) and multi‑video scenarios such as short dramas and continuous live‑stream slices.

Technical Innovations

Dual Chain of Thought (DCoT) : A dual‑chain architecture (Event Chain + Temporal Chain) extracts key events and temporal relations across videos, reducing redundant computation and enhancing multi‑video reasoning.

Ripple‑of‑Thought (RoT) Reply Generation : Structured semantic reasoning produces replies that are accurate, helpful, and engaging, markedly improving comment relevance and interaction.

Contextformer for Multimodal Memory : Models long‑dialogue context by dynamically activating multimodal memory, mitigating information loss and hallucination in extended conversations.

Business Applications

Live‑Stream Comments : AI‑generated “quick comments” and context‑aware replies increase comment trigger rates and boost e‑commerce conversion.

Intelligent Customer Service : Multimodal knowledge‑aware Q&A handles product‑detail images and videos, delivering high‑information replies and improving resolution rates.

Multimodal Knowledge Q&A : Supports both text and image inputs, enabling accurate answers in complex shopping scenarios and increasing conversion.

Future Plans

While KuaiMM Conversation has achieved strong results, challenges remain for complex merchant queries. The team is building an MMAgent system that leverages large‑model planning and execution to create more intelligent task‑handling pipelines.

How KuaiMM Conversation Revolutionizes Multimodal Dialogue on Short‑Video Platforms

Project Overview

Core Contributions

Dataset Details

Technical Innovations

Business Applications

Future Plans

Related Papers

Kuaishou Tech

How this landed with the community

Was this worth your time?

0 Comments