ICLR 2026: Kuaishou Tech Team’s Cutting‑Edge AI Research Highlights
This article reviews eight Kuaishou‑authored papers accepted at ICLR 2026, summarizing their problem statements, novel methods such as front‑door causal attribution, visual table retrieval, denoising rerankers, difficulty‑adaptive reasoning, diffusion code infilling, generative ordinal regression, multimodal video retrieval, e‑commerce dialogue benchmarks, and a new LLM creativity evaluator, together with reported experimental gains.
ALM-MTA: Front‑Door Causal Multi‑Touch Attribution for Creator‑Ecosystem Optimization
Large‑scale recommendation systems lack precise labels and contain unobserved confounders, making back‑door adjustment ineffective for multi‑touch attribution. ALM‑MTA introduces an adversarially learned mediator that serves as a proxy for the outcome, enabling front‑door identification. A contrastive learning module constrains the marginalized front‑door probability on tightly matched “consumption‑post” sample pairs, addressing positivity violations in massive intervention spaces. Evaluation uses a non‑RCT bucket protocol that estimates uplift and computes AUUC at the intervention‑cluster level. In a production system with 4 billion daily active users and 300 billion samples, ALM‑MTA yields a 0.04 % increase in DAU, a 0.6 % rise in daily active creators, and a 670 % boost in exposure efficiency. AUUC improves up to 0.070 over the previous state‑of‑the‑art across all propensity buckets, and post‑prediction AUC rises by 40 %.
Paper: https://openreview.net/pdf?id=3r68a6GOpg
Project: https://github.com/logwhistle/ALM-MTA
TaR‑ViR: Multimodal Table Retrieval in the Open World
Traditional table retrieval flattens tables into linear text, discarding structural cues such as merged cells, irregular alignments, and embedded images, which degrades performance. TaR‑ViR defines a new benchmark that treats tables as images and reformulates retrieval as a multimodal task. Experiments show that removing the fragile text‑conversion step improves retrieval accuracy, demonstrating the advantage of visual representations for preserving table structure.
Paper: https://openreview.net/forum?id=4QPgqdQmYn
DNR: Denoising Neural Reranker for Recommender Systems
In two‑stage industrial recommender pipelines, recall scores from the first stage contain rich information that is under‑utilized by existing rerankers. The authors analyze scoring behaviors across stages and model the rerank problem as noise reduction on recall scores. DNR couples a denoising reranker with a noise‑generation module, decomposing the loss into three sub‑objectives: (1) denoising recall scores via sample augmentation, (2) adversarial sample exploration, and (3) aligning the generated recall‑score distribution. Extensive experiments on three public datasets and an industrial system confirm DNR’s superiority over naive baselines and existing SOTA rerankers.
Paper: https://openreview.net/pdf?id=JlwYkFm91F
DIVA‑GRPO: Difficulty‑Adaptive Variant Advantage for Multimodal Reasoning
Group‑Relative Policy Optimization (GRPO) improves multimodal large‑language‑model reasoning but suffers from sparse rewards and advantage vanishing when tasks are too easy or too hard. DIVA‑GRPO dynamically assesses problem difficulty, samples variants at appropriate difficulty levels, and computes advantages with difficulty‑weighted normalization across local (per‑problem) and global (problem‑plus‑variant) groups. Experiments on six reasoning benchmarks demonstrate faster training convergence and higher inference performance than prior methods.
Paper: https://openreview.net/pdf?id=qKXYEg00eH
DreamOn: Diffusion Language Models for Code Infilling Beyond Fixed‑Size Canvas
Diffusion Language Models (DLMs) enable flexible, non‑autoregressive generation but require a fixed‑length mask, limiting code‑infilling when the desired length differs. DreamOn introduces two length‑control states that let the model autonomously expand or shrink its output length during diffusion, requiring only a minimal modification to the training objective and no architectural changes. Built on Dream‑Coder‑7B, DreamOn matches SOTA autoregressive models on HumanEval‑Infilling and SantaCoder‑FIM benchmarks and reaches oracle‑length performance, removing a major deployment obstacle for DLMs.
Paper: https://arxiv.org/pdf/2602.01326
Project: https://github.com/DreamLM/DreamOn
GoalRank: Group‑Relative Optimization for a Large Ranking Model
The Generator‑Evaluator (G‑E) two‑stage ranking paradigm, even when extended to multiple generators (MG‑E), shows diminishing returns as candidate list size grows. The authors prove that a sufficiently large pure generator can approximate the optimal ranking strategy more closely than any finite G‑E/MG‑E system. GoalRank trains a single powerful generator using Group‑Relative Optimization (GRO): a reward model trained on real user feedback defines a reference strategy, and the generator minimizes KL divergence to this reference. Experiments on public benchmarks and a short‑video platform with over 5 billion daily active users demonstrate significant offline and online gains, including higher user dwell time and watch duration.
Paper: https://openreview.net/pdf?id=gTMzRm8fb0
GoR: A Unified Generative Framework for Ordinal Regression
Ordinal regression traditionally relies on discretization, which suffers from ambiguous boundaries and fixed bucket rigidity. GoR reframes numeric prediction as an autoregressive token‑generation task, emitting a sequence of “additive‑semantic” tokens terminated by a dynamic <EOS>. This yields interpretable, step‑wise refinement and eliminates fixed‑bucket constraints. The authors derive a bias‑variance bound for MSE and propose the Coverage–Distinctiveness Index (CoDi) to balance bias and variance when constructing token vocabularies. Evaluated on 15 benchmarks across five domains, GoR sets new SOTA, confirming the theoretical and practical advantages of the generative paradigm.
Paper: https://openreview.net/pdf?id=ys80cc2N5M
OmniCVR: A Benchmark for Omni‑Composed Video Retrieval with Vision, Audio, and Text
Existing video‑retrieval benchmarks focus on visual‑text alignment and ignore audio cues such as speech, music, and environmental sounds, which are essential for comprehensive video understanding. OmniCVR introduces a large‑scale, fully‑multimodal benchmark that fuses vision, audio, and text queries. The dataset is built via an automated pipeline that performs content‑aware segmentation, multimodal annotation, and dual verification by large‑language models and human experts. It defines three query types—visual‑only, audio‑only, and fused multimodal—with fused queries dominating. The authors also present AudioVLM2Vec, an audio‑aware model that achieves SOTA performance on OmniCVR, highlighting current limitations of multimodal retrieval systems in audio reasoning.
Paper: https://openreview.net/pdf?id=KxxR7emO5K
Mix‑Ecom: Mixed‑Type E‑Commerce Dialogues with Complex Domain Rules
Mix‑Ecom is a corpus of 4,799 real‑world customer‑service dialogues covering four dialogue types (QA, recommendation, task‑oriented, chit‑chat), three e‑commerce task categories (pre‑sale, logistics, post‑sale), and 82 domain rules. Baselines reveal that current agents struggle with mixed‑type dialogues and rule‑heavy scenarios, often hallucinating. The dataset and a proposed dynamic framework aim to benchmark and improve agent capabilities.
Paper: https://arxiv.org/pdf/2509.23836
CreataSet and CrEval: Evaluating Text Creativity Across Diverse Domains
Assessing creativity in large language models traditionally relies on costly human judgments. Existing automatic metrics lack generalization or alignment with human perception. The authors introduce a pairwise‑comparison framework that incorporates shared context instructions to improve consistency. CreataSet contains over 100 k human‑annotated samples and more than 1 M synthetic instruction‑response pairs spanning multiple creative tasks. Training an LLM‑based evaluator, CrEval, on CreataSet yields significantly higher agreement with human judgments than prior methods. Experiments also show that combining human and synthetic data is essential for robust evaluator training.
Paper: https://arxiv.org/pdf/2505.19236
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
