Artificial Intelligence 21 min read

Kuaishou Showcases 12 Cutting-Edge CVPR 2025 Papers on Video Generation and AI

Kuaishou presented twelve peer‑reviewed papers at CVPR 2025 covering video quality assessment, large‑scale video datasets, dynamic 3D avatar reconstruction, 4D scene simulation, controllable video generation, scaling laws for diffusion transformers, multimodal foundations, and more, highlighting the company's leading research in computer vision and AI.

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Showcases 12 Cutting-Edge CVPR 2025 Papers on Video Generation and AI

CVPR (IEEE Conference on Computer Vision and Pattern Recognition) is one of the top international conferences in computer vision and pattern recognition. CVPR 2025 was held from June 11 to June 15 in Nashville, Tennessee, USA, receiving 13,008 valid paper submissions and accepting 2,878 papers, an overall acceptance rate of about 22.1%.

Among the accepted papers, Kuaishou contributed twelve papers covering video quality assessment, multimodal dataset construction and benchmarking, dynamic 3D avatar reconstruction, dynamic 4D scene simulation, video generation and enhancement, controllable video generation, and other directions.

On June 11 (U.S. time), Wan Pengfei, head of Kuaishou Keling AI Division’s Visual Generation and Interaction Center, was invited to give a talk titled “From Video Generation to World Models” at the CVPR tutorial, sharing the latest breakthroughs and frontier progress of Kuaishou’s large‑scale video generation models.

CVPR 2025
CVPR 2025

Paper 01: Koala‑36M – A Large‑scale Video Dataset Improving Consistency between Fine‑grained Conditions and Video Content

Project: https://koala36m.github.io/

Paper: https://arxiv.org/pdf/2410.08260

Abstract: As visual generation technology advances, video datasets grow exponentially, and dataset quality is crucial for model performance. Koala‑36M introduces a large‑scale, high‑quality video dataset with accurate temporal segmentation, detailed subtitles, and superior video quality. By using a linear classifier for probability distribution analysis, the method improves transition detection accuracy, provides structured subtitles (~200 characters per segment), and introduces a Video Training Suitability Score (VTSS) to filter high‑quality videos. Experiments show the pipeline significantly enhances dataset quality and validates Koala‑36M’s superiority.

Paper 02: KVQ – Boosting Video Quality Assessment via Saliency‑guided Local Perception

Paper: https://arxiv.org/abs/2503.10259

Abstract: Video Quality Assessment (VQA) predicts perceived video quality, but local quality varies due to motion blur and specific distortions. KVQ introduces a saliency‑guided local perception framework that extracts visual saliency with window attention and applies local perception constraints to reduce reliance on neighboring information. It outperforms state‑of‑the‑art methods on five VQA benchmarks and includes a new region‑level annotated dataset (LPVQ) for evaluating local perception.

Paper 03: StyleMaster – Stylize Your Video with Artistic Generation and Translation

Paper: https://arxiv.org/pdf/2412.07744

Abstract: Existing video stylization methods often produce content‑leaking results and struggle with style transfer. StyleMaster emphasizes fine‑grained style extraction, filters content‑related image patches while preserving style patches, and uses contrastive learning on a synthetic paired‑style dataset to enhance global style consistency. A lightweight motion adapter trained on static videos bridges image‑to‑video gaps, enabling high‑quality, temporally consistent stylized videos that surpass competitors.

Paper 04: Towards Precise Scaling Laws for Video Diffusion Transformers

Paper: https://arxiv.org/pdf/2411.17470

Abstract: Training video diffusion transformers is costly; accurate scaling laws are needed to predict optimal model size and hyper‑parameters under limited budgets. This work systematically analyzes scaling laws for video diffusion transformers, revealing higher sensitivity to learning rate and batch size compared to language models. A new scaling law predicts optimal hyper‑parameters for any model size and compute budget, achieving a 40.1% reduction in inference cost while maintaining performance.

Paper 05: Unleashing the Potential of Multi‑modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Paper: https://arxiv.org/pdf/2411.14423

Abstract: To simulate realistic 4D dynamic scenes, PhysFlow combines multimodal foundation models with video diffusion. It identifies material types via multimodal models, initializes parameters through image queries, and predicts 3D Gaussian splats for fine‑grained scene representation. Differentiable MPM and flow‑guided video diffusion optimize material parameters without relying on rendering or SDS losses, achieving accurate dynamic interaction modeling.

Paper 06: CoMM – A Coherent Interleaved Image‑Text Dataset for Multimodal Understanding and Generation

Paper: https://arxiv.org/abs/2406.10462

Abstract: Interleaved image‑text generation is a key multimodal task, yet existing datasets lack narrative coherence and style consistency. CoMM curates high‑quality interleaved data from diverse sources, focusing on educational and visual storytelling content. A multi‑view filtering pipeline using pretrained models ensures textual progression, image‑text alignment, and semantic consistency. Experiments show CoMM markedly improves multimodal LLMs’ few‑shot performance on coherence, consistency, and alignment tasks.

Paper 07: Libra‑Merging – Importance‑Redundancy and Pruning‑Merging Trade‑off for Acceleration Plug‑in in Large Vision‑Language Models

Paper: https://cvpr.thecvf.com/virtual/2025/poster/34817

Abstract: Large vision‑language models (LVLMs) are computationally expensive. Libra‑Merging introduces a position‑driven token importance‑redundancy balance and an importance‑guided grouping‑merge strategy, reducing FLOPs to 37% of the original with negligible performance loss. It also cuts GPU training time by 57% on video understanding tasks, serving as a plug‑and‑play acceleration module for various LVLMs.

Paper 08: GPAvatar – High‑fidelity Head Avatars by Learning Efficient Gaussian Projections

Paper: https://openaccess.thecvf.com/.../GPAvatar_CVPR_2025_paper.pdf

Abstract: Existing radiance‑field avatar methods rely on explicit priors or neural implicit representations, limiting efficiency. GPAvatar learns linear projections of high‑dimensional Gaussians to 3D space, enabling point‑based rasterization without heavy networks. An adaptive encryption strategy allocates Gaussians to high‑motion regions, improving facial detail. Experiments show superior rendering quality, speed, and memory usage.

Paper 09: PatchVSR – Breaking Video Diffusion Resolution Limits with Patch‑wise Video Super‑Resolution

Paper: https://openaccess.thecvf.com/.../PatchVSR_CVPR_2025_paper.pdf

Abstract: Video diffusion models excel at generation but face high computation and fixed resolution limits for video super‑resolution (VSR). PatchVSR introduces a dual‑stream adapter that processes local patches and global context, combined with patch position encoding and multi‑patch joint modulation. It achieves competitive 4K VSR from a 512×512 base model with significant efficiency gains.

Paper 10: SeriesBench – A Benchmark for Narrative‑Driven Drama Series Understanding

Paper: https://stan-lei.github.io/KwaiMM-Dialogue/paper2-seriesbench.html

Abstract: Existing VideoQA tasks focus on isolated clips. SeriesBench introduces 105 narrative‑driven series videos covering 28 professional tasks that require deep story understanding. The proposed PC‑DCoT framework achieves significant gains, highlighting the need for multi‑video, narrative reasoning.

Paper 11: SketchVideo – Sketch‑based Video Generation and Editing

Project: http://geometrylearning.com/SketchVideo/

Paper: https://arxiv.org/pdf/2503.23284

Abstract: SketchVideo enables spatial and motion control of video generation via user‑drawn sketches on one or two keyframes. A cross‑frame attention module propagates sparse sketch conditions to all frames, while a video insertion module ensures seamless editing. Experiments demonstrate superior controllability and quality.

Paper 12: STDD – Spatio‑Temporal Dual Diffusion for Video Generation

Paper: https://cvpr.thecvf.com/virtual/2025/poster/35022

Abstract: Existing video diffusion models focus on spatial diffusion only. STDD proposes an explicit spatio‑temporal dual diffusion process that jointly propagates information across space and time. The analytically tractable formulation enables accelerated sampling and improves temporal consistency, achieving state‑of‑the‑art results on video generation, prediction, and text‑to‑video tasks.

computer visiondeep learningvideo generationmultimodalAI researchCVPR2025
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.