Eight Kwai Papers Accepted at CVPR 2024 – Text-to-Image, Video Quality & 3D Generation
Kwai (Kuaishou) has eight papers accepted at CVPR 2024 covering multi‑dimensional human preference for text‑to‑image generation, short‑video quality assessment, efficient video quality assessment, compressed video enhancement, conditional unsigned distance fields, universal cross‑domain retrieval, perception‑oriented frame interpolation, and test‑time energy adaptation.
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) is a top‑tier international venue for computer vision research. This year 11,532 papers were submitted and 2,719 were accepted (23.6% acceptance rate), a record high. Kwai has eight papers selected for CVPR 2024, covering text‑to‑image evaluation, video quality assessment, 3D generation, and more.
Paper 01: Learning Multi‑dimensional Human Preference for Text‑to‑Image Generation
Paper URL: https://arxiv.org/pdf/2405.14705
Current text‑to‑image models are evaluated with statistical metrics that do not fully capture human preferences. Existing work reduces rich human preferences to a single overall score, ignoring the fact that preferences vary across different aspects. To address this, the authors propose Multi‑dimensional Preference Score (MPS), the first multi‑dimensional human‑preference model for evaluating text‑to‑image generation. MPS adds a preference‑conditioning module to CLIP and is trained on a newly collected Multi‑dimensional Human Preference (MHP) dataset containing 918,315 human choices over 607,541 images across four dimensions: aesthetics, semantic alignment, detail quality, and overall assessment. Experiments on three evaluation datasets and four preference dimensions show MPS outperforms existing scoring methods.
Paper 02: KVQ – Kwai Video Quality Assessment for Short‑form Videos
Paper URL: https://arxiv.org/abs/2402.07220
Short‑video UGC platforms (e.g., Kwai, Douyin) have become mainstream media, but special effects and complex processing pipelines pose challenges for quality assessment: (1) effects and distortions confuse quality‑determining regions; (2) multiple mixed distortions are hard to distinguish. The authors build KVQ, the first large short‑video quality dataset with 600 user‑uploaded videos and 3,600 processed variants, each annotated with MOS and indistinguishable‑sample rankings by visual experts. They propose KSVQE, a quality assessor that leverages large visual‑language models to recognize quality‑determining semantics and a distortion‑understanding module to separate distortions. Experiments demonstrate KSVQE’s effectiveness on KVQ and standard VQA benchmarks.
Paper 03: PTM‑VQA – Efficient Video Quality Assessment Leveraging Diverse Pre‑Trained Models from the Wild
Paper URL: https://arxiv.org/abs/2405.17765
Video quality assessment (VQA) is challenging due to factors such as content appeal, distortion types, and motion. Annotating MOS is costly, limiting dataset scale and hindering deep‑learning methods. PTM‑VQA transfers knowledge from diverse pre‑trained models to improve VQA accuracy while reducing reliance on massive labeled data. Features from multiple pre‑trained models are aggregated, and two loss terms—Intra‑sample Consistency (IC) and Inter‑sample Discriminability (ID)—are introduced to align features in a unified quality‑perception space and enforce pseudo‑clustering based on sample labels. A selection strategy for candidate pre‑trained models is also proposed. Extensive experiments validate the method’s effectiveness.
Paper 04: CPGA – Coding Priors‑Guided Aggregation Network for Compressed Video Quality Enhancement
Paper URL: https://arxiv.org/abs/2403.10362
Recent VQE methods overlook coding priors (motion vectors, residual frames) that contain rich temporal and spatial cues. CPGA introduces a coding‑prior‑guided aggregation network with two key modules: (1) Inter‑frame Temporal Aggregation to fuse temporal information from consecutive frames and coding priors; (2) Multi‑scale Non‑local Aggregation guided by residual frames to aggregate global spatial information. A new Coding‑Prior dataset (VCP) with 300 raw videos and various HEVC configurations is also released. CPGA surpasses state‑of‑the‑art methods in PSNR and runs 10% faster.
Paper 05: UDiFF – Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
Paper URL: https://arxiv.org/abs/2404.06851
Diffusion models excel at image generation, editing, and inpainting, but existing neural implicit approaches (e.g., signed distance functions) only generate closed‑surface shapes. UDiFF proposes a 3D diffusion model based on unsigned distance fields (UDF) that can generate textured 3D shapes with open surfaces, conditioned on text or unconditionally. It introduces an optimized wavelet transform to create a compact representation space for UDFs and demonstrates superior performance on standard benchmarks.
Paper 06: ProS – Prompting‑to‑Simulate Generalized Knowledge for Universal Cross‑Domain Retrieval
Paper URL: https://arxiv.org/abs/2312.12478
Universal Cross‑Domain Retrieval (UCDR) aims for robust retrieval under domain and semantic shifts. Existing prompt‑tuned pretrained models struggle with simultaneous domain and semantic transfer. ProS introduces a two‑stage prompting framework: (1) Prompt Unit Learning captures domain and semantic knowledge via masked and aligned units; (2) Context‑aware Simulator learns dynamic content‑aware prompts (CaDP) that generate universal features for UCDR. Experiments on three benchmarks show ProS achieves state‑of‑the‑art results with minimal extra parameters.
Paper 07: Perception‑Oriented Video Frame Interpolation via Asymmetric Blending
Paper URL: https://arxiv.org/pdf/2404.06692
Existing video frame interpolation (VFI) methods suffer from blur and ghosting under large motion due to motion estimation errors and misaligned supervision. The proposed PerVFI introduces an Asymmetric Synergistic Blending (ASB) module that fuses features from both reference frames, emphasizing primary content from one and complementary information from the other. A self‑learned sparse quasi‑binary mask reduces ghosting, and a normalizing‑flow‑based generator trained with negative log‑likelihood improves detail fidelity. Experiments show PerVFI markedly improves perceptual quality over prior methods.
Paper 08: TEA – Test‑time Energy Adaptation
Paper URL: https://arxiv.org/abs/2311.14402
Test‑time adaptation (TTA) enhances model generalization when test data shifts from training distribution, without accessing training data. Existing TTA methods ignore covariate shift, which can degrade calibration and introduce bias. TEA reframes the problem from an energy‑based perspective, converting a trained classifier into an energy model and aligning its internal distribution with the test data distribution. Extensive experiments across tasks, benchmarks, and architectures demonstrate TEA’s superior generalization and calibration performance.
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.