Artificial Intelligence 20 min read

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

The article highlights Kuashou's eleven high‑quality papers accepted at ICLR 2025, covering advances in streaming video understanding, 3D trajectory control, multimodal talking‑face animation, transformer indexing, efficient video generation, industrial recommendation datasets, token gradient conflict in MoE, stable segmentation, multi‑camera video synthesis, large‑scale multimodal instruction tuning, and hallucination detection in retrieval‑augmented generation.

Kuaishou Tech

Apr 23, 2025

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

ICLR (International Conference on Learning Representations) is a top AI conference focusing on deep learning and representation learning. The 2025 edition will be held in Singapore from April 24‑28, receiving 11,672 submissions and accepting 3,706 papers (31.75% acceptance rate).

Kuaishou, through sustained research in deep learning algorithms, has eleven high‑quality academic papers selected for ICLR 2025, spanning large‑scale vision‑language models, controllable video generation, 3D face animation, and more, demonstrating the company's international AI research competitiveness.

Paper 01: SVBench: A Benchmark with Temporal Multi‑Turn Dialogues for Streaming Video Understanding

Project URL: https://github.com/yzy-bupt/SVBench

Abstract: Existing LVLM benchmarks focus on isolated text inputs and lack evaluation for continuous temporal reasoning over streaming video. SVBench introduces a temporal multi‑turn dialogue benchmark built from 1,353 streaming videos (49,979 Q&A pairs) to assess LVLMs on long‑context video understanding. Experiments on 14 models show GPT‑4o performs best, while most open‑source LVLMs struggle. The proposed StreamingChat model outperforms open‑source LVLMs on SVBench and achieves comparable performance on other vision‑language benchmarks.

Paper 02: 3DTrajMaster: Mastering 3D Trajectory for Multi‑Entity Motion in Video Generation

Project URL: https://github.com/KwaiVGI/3DTrajMaster

Abstract: Current controllable video generation methods rely on 2D control signals that cannot fully express 3D motion. 3DTrajMaster proposes a plug‑and‑play 3D motion injection module using gated self‑attention to fuse multi‑entity trajectories, along with a domain adapter and annealed sampling to preserve diffusion model priors. A new 360‑Motion dataset is constructed, and extensive experiments demonstrate state‑of‑the‑art accuracy and generalization for multi‑entity 3D motion control.

Paper 03: Cafe‑Talk: Generating 3D Talking Face Animation with Multimodal Coarse‑ and Fine‑grained Control

Project URL: https://harryxd2018.github.io/cafe-talk/

Abstract: Existing speech‑driven 3D face animation methods use coarse emotion labels, limiting fine‑grained control. Cafe‑Talk adopts a diffusion‑Transformer architecture with two‑stage training: pre‑training on speech and coarse conditions, then introducing a fine‑grained control adapter based on Action Units. A swap‑label training mechanism and mask‑based classifier‑free guidance enable decoupled control, achieving state‑of‑the‑art lip‑sync and expression quality.

Paper 04: Making Transformer Decoders Better Differentiable Indexers

Project URL: https://openreview.net/pdf?id=bePaRx0otZ

Abstract: Retrieval aims to select top‑k items from massive datasets. Traditional retrieval models embed queries and items and use ANN search. Recent generative retrieval treats items as token sequences decoded autoregressively, but suffers from a disjoint index‑construction and retrieval pipeline. The proposed URI framework unifies indexing and retrieval by ensuring strong consistency between index construction and the Transformer decoder, enabling end‑to‑end optimization and moving indexing to the interaction space rather than Euclidean space. Experiments on three real datasets show significant performance gains.

Paper 05: Pyramidal Flow Matching for Efficient Video Generative Modeling

Project URL: https://pyramid-flow.github.io/

Abstract: Video generation requires modeling large spatio‑temporal spaces, leading to high computational cost. Existing cascade architectures lack knowledge sharing across stages. The proposed pyramidal flow matching reinterprets the denoising trajectory as multi‑level pyramid stages, running only the final stage at full resolution. The design links flows across pyramid levels and introduces a temporal pyramid autoregressive framework, enabling end‑to‑end training with a single diffusion Transformer. Experiments generate 768p, 24‑fps videos within 20.7k A100 GPU‑hours.

Paper 06: RecFlow: An Industrial Full Flow Recommendation Dataset

Project URL: https://github.com/RecFlow-ICLR/RecFlow

Abstract: Industrial recommendation systems operate in multi‑stage pipelines, yet existing benchmarks focus only on the exposure space. RecFlow introduces the first industrial‑grade full‑process recommendation dataset, containing 3.8 × 10⁷ interactions from 42 k users on ~9 M items, plus 1.9 B stage samples across six pipeline stages collected over 37 days. Experiments show that incorporating unexposed samples from earlier stages significantly improves algorithm performance, and several models have been deployed on Kuaishou with measurable gains.

Paper 07: Solving Token Gradient Conflict in Mixture‑of‑Experts for Large Vision‑Language Model

Project URL: https://github.com/longrongyang/STGC

Abstract: MoE models in LVLMs activate sparse experts per token, but routing modules ignore intra‑expert token gradient conflicts, causing interference. The proposed STGC method identifies conflicting tokens via token‑level gradients and applies a custom regularization loss to reroute them to other experts. The plug‑and‑play module improves performance across various LVLMs.

Paper 08: Stable Segment Anything Model

Project URL: https://github.com/fanq15/Stable-SAM?tab=readme-ov-file

Abstract: SAM achieves high‑quality segmentation with strong prompts but requires precise annotations. This work analyzes SAM’s stability under low‑quality prompts and finds bias toward background or local features. A deformable sampling plugin adjusts feature sampling locations, and a dynamic routing plugin switches between deformable and regular grid sampling based on prompt quality. The solution improves segmentation stability across diverse prompts while preserving SAM’s efficiency, using only 0.08 M learnable parameters.

Paper 09: SynCamMaster: Synchronizing Multi‑Camera Video Generation from Diverse Viewpoints

Project URL: https://jianhongbai.github.io/SynCamMaster/

Abstract: Video diffusion models excel at realistic dynamics but struggle with multi‑camera consistency. SynCamMaster proposes a plug‑in module that fine‑tunes pretrained text‑to‑video models to generate synchronized videos from arbitrary camera poses using six‑DoF pose inputs. A multi‑view synchronization module enforces content and geometry consistency, and a progressive training scheme leverages multi‑camera renders and monocular videos to compensate for limited data. Experiments demonstrate superior multi‑view consistency and introduce the SynCamVideo dataset.

Paper 10: TaskGalaxy: Scaling Multi‑modal Instruction Fine‑tuning with Tens of Thousands Vision Task Types

Project URL: https://github.com/Kwai-YuanQi/TaskGalaxy

Abstract: Multimodal vision‑language models suffer from limited task diversity. TaskGalaxy builds a 19,227‑type hierarchical instruction fine‑tuning dataset (413 k 648 samples) by expanding a small set of human‑defined tasks with GPT‑4o, filtering with CLIP and GPT‑4o, and ensuring quality via multi‑model consensus. Applying TaskGalaxy to LLaVA‑v1.5 and InternVL‑Chat‑v1.0 yields significant improvements on 16 benchmarks, confirming the value of massive task diversity.

Paper 11: ReDeEP: Detecting Hallucination in Retrieval‑Augmented Generation via Mechanistic Interpretability

Project URL: https://github.com/Jeryi-Sun/ReDEeP-ICLR

Abstract: Retrieval‑augmented generation (RAG) still produces hallucinations even with accurate retrieved context. ReDeEP analyzes LLM internals using mechanistic interpretability, identifying “Copying Heads” that fail to copy external knowledge and “Knowledge FFNs” that drown external context. The method decouples external context and parametric knowledge via regression, offering token‑level and chunk‑level hallucination scores, and introduces AARF (AddAttentionReduceFFN) to dynamically re‑weight attention and FFN outputs during inference, reducing hallucinations without additional training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal VideoGeneration AIResearch DeepLearning ICLR2025 VisionLanguage

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.