How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

CollabLLM tackles the limitations of large language models in everyday multi‑turn dialogues by introducing a user‑centric, multi‑turn training framework that leverages simulated interactions, multi‑round reward modeling, and veRL toolchain support, achieving superior performance over single‑turn baselines.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How CollabLLM Redefines LLM Collaboration with Multi‑Turn Training

LLM Collaboration Bottleneck and CollabLLM Innovation

Large language models (LLMs) can solve complex tasks but often perform poorly in simple multi‑turn conversations, making unfounded assumptions, ignoring key details, and failing to ask clarifying questions, which harms user trust.

The root cause lies in training and evaluation methods that focus on single‑turn, instruction‑following prompts and reward only immediate responses, neglecting the collaborative nature of real‑world dialogue.

Traditional reinforcement‑learning rewards single‑turn optimality, but extending to global optimality requires new approaches. CollabLLM provides a solution, now available as a veRL recipe with reproducible scripts for supervised fine‑tuning (SFT) and reinforcement learning (RL).

CollabLLM: User‑Centric Training Method

CollabLLM places the model in a simulated user interaction environment, using reinforcement learning to learn when to ask questions and how to adapt tone and communication style, bridging the gap between LLM training and actual usage.

This approach earned the ICML Outstanding Paper Award. The training framework simulates multi‑turn interactions, emphasizing that a response’s value includes facilitating the overall dialogue, not just immediate usefulness.

veRL Toolchain: Technical Support for CollabLLM Training

The veRL toolchain provides three key features for implementing CollabLLM:

Interaction : Enables dynamic, multi‑turn conversational feedback during RL training, with a user simulator that can be extended via a base interaction class.

Asynchronous Reward Computation : Distributes reward calculation across dialogue completions to avoid API rate‑limit spikes and reduce training time.

Custom Agent Loop : A generic interface for multi‑turn agents that uses a token‑in‑token‑out mechanism to preserve training precision.

interaction:
- name: "collabllm"
  class_name: "veRL.interactions.collabllm_interation.CollabLLMInteraction"
  config:
    "user_model": "volcengine/ep-20250901171555-xxxx",
    "base_url": "<https://ark.cn-beijing.volces.com/api/v3>",
    "api_key": "<api key>",
    "num_retries": 6,
    "max_tokens": 2048,
    "temperature": 1.0
reward_model.reward_manager=collabllm \
+reward_model.reward_kwargs.metric_weights.accuracy=1 \
+reward_model.reward_kwargs.metric_weights.interactivity=1 \
+reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
+reward_model.reward_kwargs.llm_judge_kwargs.model=volcengine/ep-20250901171555-xxxx \
+reward_model.reward_kwargs.llm_judge_kwargs.base_url=https://ark.cn-beijing.volces.com/api/v3 \
+reward_model.reward_kwargs.llm_judge_kwargs.api_key=<api key> \
+reward_model.reward_kwargs.llm_judge_kwargs.max_tokens=2048 \
+reward_model.reward_kwargs.llm_judge_kwargs.temperature=0

Training Practice: Key Issues and Solutions

Language Inconsistency : When using Qwen as the base model, the model may mix languages during RL training. Performing SFT before RL eliminates this issue; the problem does not appear with Llama.

Reward Hacking : Models sometimes learn to produce nonsensical or repetitive outputs due to coarse reward signals from the LLM‑as‑a‑judge. Introducing three explicit metrics—task completion (positive), interactivity (positive), and token count (negative)—and filtering outputs that do not meet strict formatting rules mitigates this problem.

Fine‑Grained LLM Judge Scoring : Detailed scoring criteria for interactivity, user understanding, and suggestion helpfulness improve the judge’s ability to differentiate model responses.

Scoring Criteria:
- Let U = user understanding & response clarity ∈ [0,1]
- Let Q = clarification ∈ [0,1]
- Let S = suggestion helpfulness ∈ [0,1]
- score = average([U, Q, S])

A100 Training Precision Issue : Gradient norm explosion was observed on A100 GPUs due to mismatched inference and training precision in vLLM’s flash‑attention implementation. Switching GPUs or disabling cascade attention resolved the issue. The problem is tracked in the vLLM flash‑attention repository.

LLMreinforcement learningmulti‑turn dialogueveRLcollaborative training
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.