Artificial Intelligence 16 min read

Eight Kwai Papers Accepted at CVPR 2024 – Text-to-Image, Video Quality & 3D Generation

Kwai (Kuaishou) has eight papers accepted at CVPR 2024 covering multi‑dimensional human preference for text‑to‑image generation, short‑video quality assessment, efficient video quality assessment, compressed video enhancement, conditional unsigned distance fields, universal cross‑domain retrieval, perception‑oriented frame interpolation, and test‑time energy adaptation.

Kuaishou Large Model
Kuaishou Large Model
Kuaishou Large Model
Eight Kwai Papers Accepted at CVPR 2024 – Text-to-Image, Video Quality & 3D Generation

IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) is a top‑tier international venue for computer vision research. This year 11,532 papers were submitted and 2,719 were accepted (23.6% acceptance rate), a record high. Kwai has eight papers selected for CVPR 2024, covering text‑to‑image evaluation, video quality assessment, 3D generation, and more.

Paper 01: Learning Multi‑dimensional Human Preference for Text‑to‑Image Generation

Paper URL: https://arxiv.org/pdf/2405.14705

Current text‑to‑image models are evaluated with statistical metrics that do not fully capture human preferences. Existing work reduces rich human preferences to a single overall score, ignoring the fact that preferences vary across different aspects. To address this, the authors propose Multi‑dimensional Preference Score (MPS), the first multi‑dimensional human‑preference model for evaluating text‑to‑image generation. MPS adds a preference‑conditioning module to CLIP and is trained on a newly collected Multi‑dimensional Human Preference (MHP) dataset containing 918,315 human choices over 607,541 images across four dimensions: aesthetics, semantic alignment, detail quality, and overall assessment. Experiments on three evaluation datasets and four preference dimensions show MPS outperforms existing scoring methods.

Paper 02: KVQ – Kwai Video Quality Assessment for Short‑form Videos

Paper URL: https://arxiv.org/abs/2402.07220

Short‑video UGC platforms (e.g., Kwai, Douyin) have become mainstream media, but special effects and complex processing pipelines pose challenges for quality assessment: (1) effects and distortions confuse quality‑determining regions; (2) multiple mixed distortions are hard to distinguish. The authors build KVQ, the first large short‑video quality dataset with 600 user‑uploaded videos and 3,600 processed variants, each annotated with MOS and indistinguishable‑sample rankings by visual experts. They propose KSVQE, a quality assessor that leverages large visual‑language models to recognize quality‑determining semantics and a distortion‑understanding module to separate distortions. Experiments demonstrate KSVQE’s effectiveness on KVQ and standard VQA benchmarks.

Paper 03: PTM‑VQA – Efficient Video Quality Assessment Leveraging Diverse Pre‑Trained Models from the Wild

Paper URL: https://arxiv.org/abs/2405.17765

Video quality assessment (VQA) is challenging due to factors such as content appeal, distortion types, and motion. Annotating MOS is costly, limiting dataset scale and hindering deep‑learning methods. PTM‑VQA transfers knowledge from diverse pre‑trained models to improve VQA accuracy while reducing reliance on massive labeled data. Features from multiple pre‑trained models are aggregated, and two loss terms—Intra‑sample Consistency (IC) and Inter‑sample Discriminability (ID)—are introduced to align features in a unified quality‑perception space and enforce pseudo‑clustering based on sample labels. A selection strategy for candidate pre‑trained models is also proposed. Extensive experiments validate the method’s effectiveness.

Paper 04: CPGA – Coding Priors‑Guided Aggregation Network for Compressed Video Quality Enhancement

Paper URL: https://arxiv.org/abs/2403.10362

Recent VQE methods overlook coding priors (motion vectors, residual frames) that contain rich temporal and spatial cues. CPGA introduces a coding‑prior‑guided aggregation network with two key modules: (1) Inter‑frame Temporal Aggregation to fuse temporal information from consecutive frames and coding priors; (2) Multi‑scale Non‑local Aggregation guided by residual frames to aggregate global spatial information. A new Coding‑Prior dataset (VCP) with 300 raw videos and various HEVC configurations is also released. CPGA surpasses state‑of‑the‑art methods in PSNR and runs 10% faster.

Paper 05: UDiFF – Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion

Paper URL: https://arxiv.org/abs/2404.06851

Diffusion models excel at image generation, editing, and inpainting, but existing neural implicit approaches (e.g., signed distance functions) only generate closed‑surface shapes. UDiFF proposes a 3D diffusion model based on unsigned distance fields (UDF) that can generate textured 3D shapes with open surfaces, conditioned on text or unconditionally. It introduces an optimized wavelet transform to create a compact representation space for UDFs and demonstrates superior performance on standard benchmarks.

Paper 06: ProS – Prompting‑to‑Simulate Generalized Knowledge for Universal Cross‑Domain Retrieval

Paper URL: https://arxiv.org/abs/2312.12478

Universal Cross‑Domain Retrieval (UCDR) aims for robust retrieval under domain and semantic shifts. Existing prompt‑tuned pretrained models struggle with simultaneous domain and semantic transfer. ProS introduces a two‑stage prompting framework: (1) Prompt Unit Learning captures domain and semantic knowledge via masked and aligned units; (2) Context‑aware Simulator learns dynamic content‑aware prompts (CaDP) that generate universal features for UCDR. Experiments on three benchmarks show ProS achieves state‑of‑the‑art results with minimal extra parameters.

Paper 07: Perception‑Oriented Video Frame Interpolation via Asymmetric Blending

Paper URL: https://arxiv.org/pdf/2404.06692

Existing video frame interpolation (VFI) methods suffer from blur and ghosting under large motion due to motion estimation errors and misaligned supervision. The proposed PerVFI introduces an Asymmetric Synergistic Blending (ASB) module that fuses features from both reference frames, emphasizing primary content from one and complementary information from the other. A self‑learned sparse quasi‑binary mask reduces ghosting, and a normalizing‑flow‑based generator trained with negative log‑likelihood improves detail fidelity. Experiments show PerVFI markedly improves perceptual quality over prior methods.

Paper 08: TEA – Test‑time Energy Adaptation

Paper URL: https://arxiv.org/abs/2311.14402

Test‑time adaptation (TTA) enhances model generalization when test data shifts from training distribution, without accessing training data. Existing TTA methods ignore covariate shift, which can degrade calibration and introduce bias. TEA reframes the problem from an energy‑based perspective, converting a trained classifier into an energy model and aligning its internal distribution with the test data distribution. Extensive experiments across tasks, benchmarks, and architectures demonstrate TEA’s superior generalization and calibration performance.

Artificial Intelligencetext-to-imagevideo quality assessment3D GenerationCVPR 2024
Kuaishou Large Model
Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.