Top 12 Kuaishou Papers Accepted at AAAI 2026: Breakthroughs in Recommendation, Video Generation, and LLM Research
Kuaishou secured 12 papers at AAAI 2026, covering advances in search and recommendation systems, multi‑camera video generation, multimodal understanding, generative model fundamentals, video large language models, experimental design, and LLM latent‑space reasoning, with three papers highlighted as oral presentations.
AAAI 2026 Overview
AAAI 2026 was held in Singapore (Jan 20‑27) with 23,680 submissions and a 17.6% acceptance rate (4,167 papers). Kuaishou contributed 12 accepted papers covering recommendation, multi‑camera video generation, multimodal understanding, generative model fundamentals, video LLMs, experimental design, and LLM latent‑space reasoning. Three papers were selected for oral presentations.
Paper 01 – Align³GR: Unified Multi‑Level Alignment for LLM‑Based Generative Recommendation (Oral)
Link: https://arxiv.org/abs/2511.11255 Key contributions: (1) Introduces a Semantic‑Collaborative ID (SCID) that fuses semantic and collaborative signals at the representation level. (2) Multi‑task supervised fine‑tuning aligns LLM outputs with basic recommendation ability via bidirectional alignment. (3) Progressive preference alignment combines self‑play reinforcement learning with real‑world feedback to adapt preferences under sparse signals. Experiments on public benchmarks show +17.8% Recall@10 and +20.2% NDCG@10 over the strongest baselines.
Paper 02 – CroPS: Improving Dense Retrieval with Cross‑Perspective Positive Samples in Short‑Video Search (Oral)
Link: https://arxiv.org/pdf/2511.15443 Method: Builds a multi‑perspective data engine that enriches training signals through (i) query‑level augmentation via query rewrites, (ii) system‑level expansion by injecting high‑confidence interactions from recommendation streams, and (iii) world‑knowledge injection using LLM‑generated synthetic samples. Introduces Hierarchical Label Assignment (HLA) with an H‑InfoNCE loss to rank strong, weak, and negative relevance separately. Deployed at scale, CroPS improves click‑through rate, video watch time, and reduces query‑switch rates.
Paper 03 – Fairness‑Aware Design for Contextual Experiments (Oral)
Key idea: Proposes the F‑CTSD algorithm that jointly minimizes required sample size while enforcing fairness constraints across heterogeneous subgroups. Derives exact sample‑complexity bounds under fairness constraints and proves unbiased treatment‑effect estimation. Empirical results reduce fairness violation rates by 4.95% compared to baselines.
Paper 04 – Beyond Tokens: Dynamic Latent Reasoning via Semantic Residual Refinement
Innovation: Introduces DyLaR, which uses a Semantic Residual Refinement (SRR) module to iteratively merge hidden‑state residuals with token‑embedding projections, creating latent representations beyond the convex hull of word embeddings. A dynamic switching mechanism driven by output entropy toggles between discrete token generation and latent‑space reasoning. Achieves up to 4.95% accuracy gain and 17.52% token‑efficiency improvement on reasoning benchmarks.
Paper 05 – BLM‑Guard: Explainable Multimodal Ad Moderation
Approach: Combines Interleaved‑modal Chain‑of‑Thought (ICoT) reasoning, policy‑aligned reward modeling, and rule‑driven supervision. The GRPO‑SCAR algorithm provides self‑consistency and adaptive rewards, integrating policy rules with model outputs for dynamic alignment. Experiments show superior accuracy, cross‑scene consistency, and rule generalization.
Paper 06 – Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings
Link: https://arxiv.org/abs/2503.18719 Technique: Proposes 2‑D Random Positional Encoding (RPE‑2D) that samples coordinates without replacement on a larger grid during training, statistically covering high‑resolution positions. At inference, deterministic near‑uniform grids interpolate these encodings. Combined with random scaling/cropping augmentations and attention‑time‑step offsets, RPE‑2D achieves state‑of‑the‑art resolution‑generalization on ImageNet, enabling low‑resolution training to generate high‑resolution images efficiently.
Paper 07 – Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
Link: https://arxiv.org/abs/2512.23413 Contributions: Releases the RAD dataset (70k samples) with structured aesthetic descriptions generated via an iterative pipeline. Proposes ArtQuant, a framework that leverages a large‑language‑model decoder to jointly model visual and textual modalities, reducing prediction entropy by 67% and achieving state‑of‑the‑art performance with only 33% of traditional training epochs.
Paper 08 – FilmWeaver: Weaving Consistent Multi‑Shot Videos with Cache‑Guided Autoregressive Diffusion
Link: https://arxiv.org/abs/2512.11274 Method: Uses an autoregressive diffusion backbone and a dual‑cache system: a shot‑memory cache preserves long‑term character and background identity, while a temporal‑memory cache stores recent frames for smooth motion. Supports multi‑concept injection, video extension, and interactive editing, outperforming baselines on consistency and aesthetic metrics.
Paper 09 – LLM‑Aligned Geographic Item Tokenization for Local‑Life Recommendation
Link: https://arxiv.org/abs/2511.14221 Framework: LGSID consists of (1) an RL‑based geographic LLM alignment mechanism that samples location‑dense prompts and trains a list‑wise reward model, (2) Geographic Direct Preference Optimization (G‑DPO) that creates hybrid preference data and aligns LLMs with geographic signals via reinforcement learning, and (3) hierarchical geographic tokenization that first quantizes spatial attributes and then applies residual quantization on geographic‑aware semantic vectors. Experiments on Kuaishou’s industrial dataset show significant relevance and conversion gains; the system is fully deployed.
Paper 10 – OneSug: Unified End‑to‑End Generative Framework for E‑commerce Query Suggestion
Link: https://arxiv.org/pdf/2506.06913 Design: Replaces the traditional recall→ranking→re‑ranking pipeline with a single encoder‑decoder generation model. Core components: (i) prefix2query representation enhancement that fuses semantic understanding and interaction signals, (ii) end‑to‑end generation of suggestion queries, and (iii) behavior‑segmented reward weighting to capture fine‑grained user preferences. Deployed in Kuaishou e‑commerce search, OneSug improves user experience and conversion while handling full traffic for over six months.
Paper 11 – TEMPLE: Incentivizing Temporal Understanding of Video LLMs via Progressive Pre‑SFT Alignment
Link: https://arxiv.org/pdf/2503.16929 Pipeline: (1) Generates temporal preference pairs automatically using video perturbations (segment drop, shuffle, reverse). (2) Progressive Pre‑SFT Alignment applies curriculum learning to gradually increase perturbation difficulty, followed by direct preference optimization before instruction‑following fine‑tuning. (3) Introduces a benchmark suite covering strong, weak, and negative temporal relations. Results show lower training loss, more stable gradients, and substantial gains on temporal perception and reasoning tasks.
Paper 12 – TIME: Temporal‑Sensitive Multi‑Dimensional Instruction Tuning and Robust{nbsp}Benchmarking for Video‑LLMs
Link: https://arxiv.org/pdf/2503.09994 Contributions: Builds a multi‑dimensional temporal instruction dataset covering duration, order, position, and dynamic reasoning, leveraging VidOR, Ego4D, and automated Q&A generation with bias‑mitigation filters. Training incorporates frame‑index prediction and video‑QA auxiliary tasks without extra annotation cost. Introduces a temporal‑focused benchmark with strict single‑frame shortcut filtering. Experiments demonstrate significant improvements across four major video‑LLMs while preserving general video task performance.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
