Artificial Intelligence 16 min read

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

Kuaishou Large Model

Jun 5, 2025

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

The International Conference on Computational Linguistics (ACL) is a premier venue for natural language processing research. The 63rd ACL will be held in Vienna from July 27 to August 1, 2025. Recently, Kuaishou's foundational large‑model team announced that seven of its papers have been accepted, spanning cutting‑edge topics such as alignment bias in training, safety during inference, decoding strategies, video‑temporal understanding, and novel evaluation benchmarks.

Paper 01: TUNA – Comprehensive Fine‑grained Temporal Understanding Evaluation on Dense Dynamic Videos

Type: ACL25 Main

Paper link: https://friedrichor.github.io/projects/TUNA/

Abstract: Video uniquely integrates temporal elements—including shots, scenes, actions, and attributes—and their dynamic relationships over time. Existing video‑understanding benchmarks often treat these aspects separately or focus on limited facets, ignoring overall video coherence. To address this, we propose TUNA, a temporally‑focused benchmark for dense dynamic videos that includes two complementary tasks: video captioning and question answering. TUNA features diverse video scenes and dynamic attributes, accompanied by interpretable and robust evaluation metrics. Evaluations of several state‑of‑the‑art models on TUNA reveal key challenges in temporal video understanding, such as limited action description capability, insufficient multi‑entity comprehension, and insensitivity to camera motion, offering valuable insights for improving video‑understanding models.

Paper 02: Root Defense Strategies – Ensuring Safety of LLM at the Decoding Level

Type: ACL25 Main

Paper link: https://arxiv.org/pdf/2410.06809

Abstract: As large language models (LLMs) evolve, the risk of harmful outputs caused by erroneous or malicious prompts increases. Existing methods effectively mitigate jailbreak risks but suffer two major limitations: (1) they assess harmfulness only at the pre‑fill stage, neglecting valuable information from the decoding process, leading to reduced effectiveness and robustness; (2) single‑stage rejection of potentially harmful outputs severely harms model usefulness. We investigate LLMs' ability to recognize harmful outputs, quantifying their capacity to evaluate the danger of preceding tokens. Inspired by these findings, we design a robust defense mechanism at the decoding level. Our novel decoding‑oriented, step‑wise defense architecture directly corrects harmful query outputs instead of simply rejecting them, employing speculative decoding to improve usability and deployment speed. Extensive experiments show that our approach enhances model safety without affecting inference speed, leveraging LLMs' inherent harmful‑information detection while preserving high utility.

Paper 03: Towards Reward Fairness in RLHF – From a Resource Allocation Perspective

Type: ACL25 Main

Paper link: https://arxiv.org/pdf/2505.23349

Abstract: Reward functions serve as proxies for human preferences in Reinforcement Learning from Human Feedback (RLHF). Imperfect reward proxies often exhibit biases such as length preference, which can impair the alignment of large language models (LLMs). We term these biases “Reward Unfairness.” By framing preference learning as a resource‑allocation problem, we treat rewards as resources that must be distributed while balancing utility and fairness. We propose two fairness‑enhancing mechanisms: a fairness regularization term and a fairness coefficient. Applied respectively during the validation phase and the RL phase, they yield a fair‑reward model and a fair policy model. Experiments in both settings demonstrate that our methods achieve more equitable alignment of LLMs with human preferences.

Paper 04: HAIC – Improving Human Action Understanding and Generation with Better Captions for Multimodal Large Language Models

Type: ACL25 Main

Paper link: https://arxiv.org/abs/2502.20811

Abstract: Multimodal large language models have made significant progress in video understanding, yet their performance on human‑action videos remains limited due to a lack of high‑quality data. Moreover, the formulation of training data strongly influences model comprehension. We introduce a two‑stage annotation pipeline: first, we collect videos with clear human actions from the web; second, we annotate them using a standardized description format that distinguishes individuals via attributes and details their actions and interactions chronologically. This process yields two datasets, HAICTrain (126 k video‑caption pairs generated and validated by our pipeline) and HAICBench (500 manually annotated video‑caption pairs plus 1 400 QA pairs for comprehensive human‑action evaluation). Experiments show that training on HAICTrain markedly improves human‑action understanding across multiple public benchmarks and also enhances video‑to‑text generation quality.

Paper 05: GODBench – A Benchmark for Multimodal Large Language Models in Video Comment Art

Type: ACL25 Main

Paper link: https://stan-lei.github.io/KwaiMM-Dialogue/paper3-godbench.html

Abstract: Video‑comment art enriches user engagement by delivering humor, satire, or emotional resonance, demanding deep cultural and contextual understanding. While multimodal large language models (MLLMs) and chain‑of‑thought (CoT) reasoning excel in STEM tasks, they struggle to generate creative, resonant jokes or satire. Existing benchmarks suffer from single‑modality focus and limited category coverage, hindering comprehensive assessment of creative video‑comment generation. To fill this gap, we introduce GODBench, a novel benchmark that fuses video and text modalities to systematically evaluate MLLMs' ability to produce video‑comment art. Inspired by ripple patterns in physics, we also propose Ripple of Thought (RoT), a multi‑step reasoning framework designed to boost MLLM creativity. Extensive experiments reveal that current MLLMs and CoT methods face substantial challenges in creative video‑comment generation, whereas RoT offers an effective pathway toward significant creative breakthroughs.

Paper 06: Mixture of Decoding – An Attention‑Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision‑Language Models

Type: ACL25 Findings

Paper link: https://arxiv.org/pdf/25.05.170

Abstract: Large vision‑language models (LVLMs) achieve impressive performance across visual tasks but remain plagued by hallucinations. We propose Mixture of Decoding (MoD), a novel strategy that dynamically adjusts decoding based on the correctness of the model’s attention to image tokens. MoD measures consistency between outputs generated from original image tags and those generated from tags the model actually attends to. Consistent outputs indicate correct attention, prompting MoD to amplify key information via a complementary strategy; inconsistent outputs signal erroneous attention, leading MoD to apply a contrasting strategy that suppresses misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods on multiple mainstream benchmarks, effectively reducing hallucinations in LVLMs.

Paper 07: VidCapBench – A Comprehensive Benchmark of Video Captioning for Controllable Text‑to‑Video Generation

Type: ACL25 Findings

Paper link: https://arxiv.org/pdf/2502.12782

Abstract: Controllable text‑to‑video (T2V) models rely heavily on the alignment between videos and their textual descriptions, yet existing research rarely connects video‑caption evaluation with T2V generation assessment. We introduce VidCapBench, a benchmark designed for T2V that evaluates video captions independent of any specific description format. VidCapBench employs a data‑annotation pipeline combining expert model labeling and human refinement, linking each video to key information covering aesthetics, content, motion, and physical laws. These attributes are split into automatically assessable and manually assessable subsets to satisfy both rapid agile development and thorough verification needs. Evaluations of numerous state‑of‑the‑art caption models show that VidCapBench offers superior stability and comprehensiveness compared to existing video‑caption metrics, ensuring that the benchmark measures caption quality rather than the evaluator model’s ability. Correlation analysis with T2V quality metrics confirms that higher VidCapBench scores predict better T2V generation, making the benchmark valuable for guiding T2V model training.

These seven papers collectively showcase Kuaishou's deep research in large‑model alignment, safety, decoding, multimodal understanding, and benchmark creation, underscoring the company’s commitment to advancing AI technology.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI Large Language Models benchmark video understanding ACL 2025

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.