Kuaishou’s Nine Accepted Papers at ACM MM 2023: Summaries and Links
This article presents concise English summaries of nine Kuaishou research papers accepted at ACM MM 2023, covering topics such as no‑reference video quality assessment, adaptive video quality models, blind image super‑resolution, audio‑visual‑language transfer learning, motion‑aware video diffusion, large‑scale e‑commerce retrieval, and interactive segmentation.
The ACM International Conference on Multimedia (ACM MM) 2023 in Ottawa accepted nine papers from Kuaishou, spanning video quality assessment, cross‑modal learning, image generation, super‑resolution, segmentation, and large‑scale retrieval. Below are English summaries, authors, and download links for each work.
Paper 01: Capturing Co‑existing Distortions in User‑Generated Content for No‑reference Video Quality Assessment (Oral) Download: https://arxiv.org/abs/2307.16813 Authors: Yuan Kun (Kuaishou), Kong Zishang (Peking University), Zheng Chuanchuan (Kuaishou), Sun Ming (Kuaishou) Abstract: Kuaishou produces massive daily video content. To improve Quality‑of‑Experience, the paper proposes a Visual Quality Transformer (VQT) for no‑reference assessment of user‑generated videos. VQT uses sparse‑sampling self‑attention to efficiently select low‑quality keyframes and models multiple co‑existing distortions (block artifacts, noise, motion blur) via parallel processing of video frame sequences with varying sparsity. It achieves state‑of‑the‑art performance on public benchmarks, surpassing YouTube’s VMAF and Apple’s AVQT, and is deployed in Kuaishou’s internal quality monitoring, adaptive processing, and recommendation pipelines.
Paper 02: Ada‑DQA: Adaptive Diverse Quality‑aware Feature Acquisition for Video Quality Assessment (Oral) Download: https://arxiv.org/abs/2308.00729 Authors: Liu Hongbo (Tsinghua), Wu Mingda (Kuaishou), Yuan Kun (Kuaishou), Sun Ming (Kuaishou), Tang Yansong (Tsinghua), Zheng Chuanchuan (Kuaishou), Li Xiu (Tsinghua) Abstract: High‑cost dense subjective labeling limits deep‑learning‑based video quality assessment. Ada‑DQA leverages diverse pretrained foundation models to acquire quality‑relevant features (content, degradation, motion) and employs knowledge distillation for efficient inference. It improves prior best results by 2‑3 % on public datasets and provides white‑box quality analysis and actionable improvement suggestions within Kuaishou’s visual quality system.
Paper 03: Blind Image Super‑resolution with Rich Texture‑Aware Codebook (Oral) Download: https://arxiv.org/abs/2310.17188 Authors: Qin Rui (Tsinghua), Sun Ming (Kuaishou), Zhang Fangyuan (Tsinghua), Wang Bin (Tsinghua) Abstract: Existing blind SR methods rely on high‑resolution‑only codebooks, which struggle with diverse low‑resolution degradations. The proposed RTCNet introduces a Rich Texture‑aware Codebook, combining a Degradation‑robust Texture Prior Module (DTPM) that incorporates low‑resolution data into codebook learning, and a Patch‑aware Texture Prior Module (PTPM) that uses patch‑level semantic pre‑training to correct texture mis‑perception. RTCNet achieves the best PSNR gains (0.16‑0.46 dB) on multiple benchmarks and is integrated into Kuaishou’s Enhancement Processing pipeline.
Paper 04: Parameter‑Efficient Transfer Learning for Audio‑Visual‑Language Tasks (Oral) Download: https://arxiv.org/abs/2308.14274 Authors: Liu Hongye (China Metrology University), Xie Xianhai (Kuaishou), Gao Yang (Kuaishou), Li Si Ze (Kuaishou), Yu Zhou (Hangzhou Dianzi University) Abstract: Fine‑tuning all parameters of large pretrained models becomes infeasible as model size grows. The paper introduces a Long‑Short‑Term Three‑Modal Adapter (LSTTA) that inserts lightweight adapter modules between frozen pretrained audio, visual, and language blocks. LSTTA contains a long‑term gating module for overall video semantics and a short‑term interaction module for local dynamics, achieving >1.6 % improvements on multiple three‑modal benchmarks with far fewer trainable parameters.
Paper 05: MV‑Diffusion: Motion‑aware Video Diffusion Model Download: https://hexiangteng.github.io/papers/ACM%20MM%202023%20MV%20diffusion.pdf Authors: Deng Zijun (Peking University), He Xiangteng (Peking University), Peng Yuxin (Peking University), Zhu Xiongwei (Kuaishou), Cheng Lele (Kuaishou) Abstract: The work proposes a motion‑aware video diffusion model that explicitly models local motion trends using global trajectory information and motion‑trend attention, improving temporal coherence over existing autoregressive diffusion methods. Experiments on three video generation tasks demonstrate the effectiveness of the trajectory modeling and motion‑trend modules.
Paper 06: Real20M: A Large‑scale E‑commerce Dataset for Cross‑domain Retrieval Download: https://hexiangteng.github.io/papers/ACM%20MM%202023%20Real20M.pdf Authors: Chen Yanzhe (Peking University), Zhong Huasong (Kuaishou), He Xiangteng (Peking University), Peng Yuxin (Peking University), Cheng Lele (Kuaishou) Abstract: Real20M is a 20‑million‑item multimodal dataset containing e‑commerce products and short videos, designed for cross‑domain retrieval. The dataset is collected via a query‑driven pipeline, includes rich multimodal signals, and is paired with a three‑stage entity‑prompt learning framework and a query‑driven cross‑domain retrieval (QCD) method that aligns product and video modalities effectively.
Paper 07: Automatic Human Scene Interaction through Contact Estimation and Motion Adaptation Download: https://dl.acm.org/doi/10.1145/3581783.3612218 Authors: Zhang Mingrui (Tsinghua), Chen Ming (Kuaishou), Zhou Yan (Kuaishou), Jian Weihua (Kuaishou), Wan Pengfei (Kuaishou), Chen Li (Tsinghua) Abstract: The paper tackles natural interaction generation between virtual characters and environments. It first estimates human‑environment contact by fusing 2D image cues and 3D pose estimation, then transfers motions using a kinematics‑constrained enhanced Laplacian semantic descriptor. The method achieves state‑of‑the‑art contact accuracy and is deployed in Kuaishou’s digital‑human production pipeline.
Paper 08: Feature Decoupling‑Recycling Network for Fast Interactive Segmentation Download: https://arxiv.org/abs/2308.03529 Authors: Zeng Huimin (USTC), Wang Weinong (Xiaohongshu), Tao Xin (Kuaishou), Xiong Zhiwei (USTC), Dai Yurong (Dartmouth), Pei Wenjie (Harbin Institute of Technology) Abstract: Existing interactive segmentation repeatedly extracts features from the source image, causing redundancy. FDRN decouples source‑image semantics from user guidance, separates high‑ and low‑level features, and isolates current from historical guidance, enabling feature reuse across interactions. Experiments on six datasets show up to 4.25× speedup with competitive segmentation quality and strong cross‑task generalization.
Paper 09: Scene‑Generalizable Interactive Segmentation of Radiance Fields Download: https://arxiv.org/abs/2308.05104 Authors: Tang Songlin (Harbin Institute of Technology Shenzhen), Pei Wenjie (Harbin Institute of Technology Shenzhen), Tao Xin (Kuaishou), Jia Tanghui (Harbin Institute of Technology Shenzhen), Lu Guangming (Harbin Institute of Technology Shenzhen), Dai Yurong (Dartmouth) Abstract: The authors introduce SGISRF, a method that enables interactive 3D object segmentation in unseen radiance‑field scenes using only a few 2D clicks from multi‑view images. Key contributions include cross‑dimension guidance propagation, a 3D segmentation module that reduces uncertainty, and a hidden‑exposed supervision scheme to correct 3D errors caused by 2D mask supervision. The approach outperforms scene‑specific baselines on two challenging real‑world benchmarks.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.