Artificial Intelligence 11 min read

What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models

The 62nd ACL conference accepted four papers from Kuaishou that explore multi‑turn instruction following, self‑agreement reasoning, fine‑grained reinforcement learning, and dynamic routing in Mixture‑of‑Experts models, each with detailed methods, experimental results, author lists, and public arXiv links.

Kuaishou Tech

May 27, 2024

What Kuaishou’s Four ACL Papers Reveal About the Future of Large Language Models

Parrot – Enhancing Multi‑Turn Instruction Following for Large Language Models

Link: https://arxiv.org/pdf/2310.07301

Core contribution: Introduces an efficient pipeline for constructing large‑scale multi‑turn instruction data and a novel fine‑tuning objective.

Data collection: A pretrained "ask" model simulates user queries and interacts with a target LLM to generate dialogues of more than ten turns using publicly available ShareGPT logs. The process captures phenomena such as coreference and ellipsis that are common in real multi‑turn interactions.

Context‑Aware Preference Optimization (CaPO): For each dialogue, the method creates context‑aware positive‑negative pairs by editing the dialogue context (e.g., deleting, replacing, or adding noise). These pairs are used to fine‑tune the LLM, encouraging it to prefer responses that are consistent with the full conversational context.

Evaluation benchmark: Extends MT‑Bench to MT‑Bench++ with eight dialogue rounds, providing a more rigorous test of multi‑turn instruction adherence.

Result: CaPO‑fine‑tuned models achieve noticeable gains on MT‑Bench++ compared with standard supervised fine‑tuning, demonstrating better handling of complex, multi‑turn instructions.

Just Ask One More Time! Self‑Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

Link: https://arxiv.org/abs/2311.08154

Core contribution: A generic ensemble‑style inference method that does not require task‑specific annotations or external selector models.

Sampling phase: The language model generates multiple distinct reasoning paths for a given problem by sampling from its decoder.

Self‑Agreement phase: The same model is prompted with the set of sampled paths and asked to choose the answer that is most consistent across them. This leverages the model’s own knowledge to resolve ambiguities.

Advantages: Avoids greedy decoding pitfalls (repetition, locally optimal answers) and eliminates the need for rule‑based post‑processing such as Self‑Consistency or a separately trained selector.

Empirical results: Outperforms existing baselines on six public reasoning benchmarks, showing both higher accuracy and stronger robustness to variations in prompt format.

Improving Large Language Models via Fine‑grained Reinforcement Learning with Minimum Editing Constraint

Link: https://arxiv.org/abs/2401.06081

Core contribution: Proposes RLMEC, a reinforcement‑learning framework that provides token‑level supervision by using a generative reward model.

Generative reward model: Trained to label each token of a model’s answer as correct or incorrect and to rewrite the answer with the smallest possible edit set that yields a correct reference.

Minimum editing constraint: The reward model learns to produce the minimal edit sequence, ensuring that supervision focuses on the exact error locations rather than the whole answer.

Training objectives:

Token‑level RL loss that maximizes the probability of correct tokens according to the reward model.

Imitation‑learning regularizer that encourages the policy model to follow the minimally edited reference, stabilizing training.

Results: Experiments on eight complex reasoning tasks (including mathematical problem solving) show significant accuracy improvements over standard RLHF and over instance‑level reward baselines.

Harder Task Needs More Experts – Dynamic Routing in Mixture‑of‑Experts Models

Link: https://arxiv.org/abs/2403.07652

Core contribution: Replaces the static Top‑K expert selection in MoE architectures with a dynamic Top‑P routing strategy that adapts the number of activated experts to input difficulty.

Dynamic Top‑P routing: Experts are sorted by confidence scores; experts are activated sequentially until the cumulative confidence exceeds a predefined threshold p. This allows harder tokens to engage more experts while easier tokens use fewer.

Parameter efficiency: Compared with a fixed Top‑2 routing, the Top‑P method activates roughly 90 % of the parameters on average but yields a 0.7 % absolute accuracy gain on several evaluation tasks.

Layer‑wise behavior: Analysis shows that lower layers tend to activate more experts, whereas higher layers activate fewer, suggesting a depth‑dependent allocation of capacity.

Implications: The approach provides a simple, tunable mechanism to balance computational cost and performance, and can be integrated into existing MoE frameworks without architectural changes.