Artificial Intelligence 14 min read

Highlights of Five Selected AAAI 2024 Papers on Recommendation, Retrieval, and Video Generation

This article presents concise overviews of five AAAI 2024 accepted papers covering multi‑stage reinforcement‑learning recommendation, error‑adaptive watch‑time prediction, coarse‑to‑fine text‑to‑video retrieval, enhanced fashion image retrieval, and conditional image‑to‑video generation, each with authors, download links, and reported performance gains.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Highlights of Five Selected AAAI 2024 Papers on Recommendation, Retrieval, and Video Generation

The AAAI Conference on Artificial Intelligence is a top‑tier international venue that received 12,100 submissions in 2024, accepting 2,342 papers (23.75% acceptance). Below are brief introductions to five selected papers spanning reinforcement‑learning recommendation, watch‑time prediction, text‑to‑video retrieval, fashion image retrieval, and conditional image‑to‑video generation.

Paper 01: UNEX‑RL: Reinforcing Long‑Term Rewards in Multi‑Stage Recommender Systems with UNidirectional EXecution Download: https://arxiv.org/abs/2401.06470 Authors: Gengrui Zhang, Yao Wang, Xiaoshuang Chen, Hongyi Qian, Kaiqiao Zhan, Ben Wang (Kuaishou) Abstract: Recent interest in using reinforcement learning (RL) to optimize long‑term rewards in recommender systems faces challenges when systems are multi‑stage (e.g., coarse‑ranking, fine‑ranking, re‑ranking). A single agent cannot model the distinct observation spaces of each stage. The authors propose UNEX‑RL, a multi‑agent RL framework with unidirectional execution that better optimizes long‑term user rewards across stages. They identify observation dependency and cascade effects as key challenges and introduce a Cascading Information Chain (CIC) to separate independent observations from action‑dependent ones, enabling effective training of UNEX‑RL. Variance‑reduction techniques are also discussed. Offline experiments on public datasets and online A/B tests on a platform with >100 M users show a 0.558% increase in user watch time compared to single‑agent RL.

Paper 02: CREAD: A Classification‑Restoration Framework with Error Adaptive Discretization for Watch Time Prediction in Video Recommender Systems Download: http://arxiv.org/abs/2401.07521 Authors: Jie Sun, Zhaoying Ding, Xiaoshuang Chen, Qi Chen, Yincheng Wang, Kaiqiao Zhan, Ben Wang (Kuaishou) Abstract: Predicting watch time is crucial but difficult due to its highly imbalanced distribution. Existing methods discretize watch time into multiple binary targets, incurring large learning or restoration errors. The proposed CREAD framework consists of Discretization, Classification, and Restoration modules. It introduces Error‑Adaptive Discretization (EAD) to balance learning and restoration errors, achieving better performance than traditional discretization. Offline evaluations on public and industrial datasets, as well as A/B tests on Kuaishou, demonstrate a 0.29% increase in average watch time.

Paper 03: Towards Efficient and Effective Text‑to‑Video Retrieval with Coarse‑to‑Fine Visual Representation Learning Download: https://arxiv.org/abs/2401.00701 Code: https://github.com/adxcreative/EERCF Authors: Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li (Kuaishou) Abstract: Large‑scale image‑text pre‑training (e.g., CLIP) has spurred text‑to‑video retrieval research. Existing state‑of‑the‑art methods either fuse text and visual features (single‑tower) or use fine‑grained alignments, both incurring high computational cost. The authors propose a two‑stage recall‑rerank architecture (EERCF). During training, a parameter‑free Text‑Gate Interaction Block (TIB) learns fine‑grained video representations with a Pearson constraint. At inference, a coarse‑grained representation quickly recalls top‑k candidates, which are then re‑ranked using fine‑grained features. EERCF achieves comparable or superior accuracy while reducing FLOPs by 14‑126× on benchmarks (MSRVTT‑1K‑Test, MSRVTT‑3K‑Test, VATEX, ActivityNet).

Paper 04: FashionERN: Enhance‑and‑Refine Network for Composed Fashion Image Retrieval Download: http://39.108.48.32/mipl/download_paper.php?fileId=202403 Authors: Yanzhe Chen (Peking University), Huasong Zhong (Kuaishou), Xiangteng He (Peking University), Yuxin Peng (Peking University), Jiahuan Zhou (Peking University), Lele Cheng (Kuaishou) Abstract: Composed fashion retrieval combines a reference image with a short textual modification to find target items. Existing methods use symmetric encoders pretrained on non‑e‑commerce data, leading to a “visual‑dominant” bias where the reference image overwhelms the textual cue. The proposed FashionERN introduces a three‑branch text‑semantic strengthening module and a two‑stage visual‑semantic optimization module that progressively filters irrelevant visual details while enriching text semantics, achieving state‑of‑the‑art performance on four e‑commerce datasets.

Paper 05: Decouple Content and Motion for Conditional Image‑to‑Video Generation Download: https://arxiv.org/abs/2311.14294 Authors: Cuifeng Shen, Yulu Gan (Peking University), Chen Chen (Institute of Automation, CAS), Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang (Kuaishou) Abstract: The paper presents a video diffusion model that explicitly models motion to improve temporal coherence. It introduces a trajectory modeling module that extracts global motion trajectories and a motion‑trend attention mechanism that infers motion trends from optical flow rather than implicit RGB cues. Experiments on three video generation tasks demonstrate superior quality and efficiency compared with existing autoregressive diffusion approaches.

All five papers report significant empirical gains on public benchmarks and large‑scale industrial deployments, highlighting the practical impact of recent advances in reinforcement learning for recommendation, adaptive discretization for watch‑time prediction, efficient cross‑modal retrieval, fashion‑specific composed retrieval, and motion‑aware video generation.

Artificial IntelligenceRecommendation systemsGenerative Modelsreinforcement learningvideo retrievalAAAI 2024
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.