Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling

The rapid rise of large language models (LLMs) has enabled many text‑generation scenarios, but in digital‑human live streaming the generated replies often feel overly formal and exhibit a strong "AI" impression, which degrades the realism of voice‑over after text‑to‑speech conversion.

Problem Statement

Although LLMs provide correct and helpful answers, their responses are typically written in a formal style, lacking the conversational tone required for live digital avatars. Existing solutions—prompt engineering, persona prompts, or fine‑tuning on generic data—only partially reduce the AI feel and still fall short of true human‑like interaction.

Proposed Solution

The authors introduce a two‑stage approach based on real‑time ASR data from live streams:

Human‑like Training Data Generation : Build a high‑quality <AI‑reply, human‑like reply> pair dataset by automatically cleaning noisy ASR transcripts, aligning them with corresponding product details, and constructing parallel rewrite pairs. This pipeline includes ASR quality filtering, logical repair, and content‑consistency alignment.

Reward‑Based Reinforcement Learning : Train a rewrite model via supervised fine‑tuning (SFT) on the cleaned pairs, then develop a binary reward model that judges the human‑likeness of a reply. Integrate this reward into a GRPO (Generalized Reward‑Based Policy Optimization) RL loop, allowing the base model to directly generate human‑like responses without a separate rewrite step.

Data Cleaning Pipeline

The cleaning process consists of two parts:

Part 1 – ASR Cleaning : Filter out short or ambiguous ASR, generate product details and questions, and repair broken sentence boundaries.

Part 2 – Content Alignment : Ensure the ASR and generated online replies convey the same meaning, then filter out mismatched or overly short pairs. The resulting dataset contains over 3 k high‑quality pairs derived from 30 k raw ASR entries.

Model Training

Various Qwen‑2.5 models (1.5B, 7B, 72B) were evaluated; the 7B‑instruct variant offered the best trade‑off between performance and latency. Fine‑tuning hyper‑parameters were:

learning_rate=2e-5</code>
<code>epoch=4</code>
<code>lr_scheduler_type='cosine'

Training curves show that the cleaned data reduces initial loss and stabilizes at a lower final loss, indicating easier learning of the style transfer.

Reward Model and GRPO Training

The reward model is a BERT‑based binary classifier trained on &lt;AI‑reply, human‑like reply&gt; pairs, with careful masking of frequent style tokens to avoid shortcut learning. During GRPO, multiple rewards (accuracy, helpfulness, style, length) are combined; length rewards are adjusted to encourage replies 1.2–1.5× longer than the reference.

Experimental Results

Evaluation on a held‑out test set shows the final system achieves 92.0% accuracy and 97.0% helpfulness, with human judges rating 85.5% of the generated audio as more natural than the baseline. Sample comparisons illustrate improvements in word order, emphasis of key points, and removal of formal phrasing.

Conclusion and Future Work

The ASR‑driven data pipeline and reward‑augmented RL effectively reduce AI‑sense in live‑stream replies while preserving correctness. Future directions include developing a graded human‑likeness scoring system, incorporating reasoning steps missing from ASR, extending the method to longer texts, and exploring factor‑based style transfer techniques.

References

Distilling Text Style Transfer With Self‑Explanation From LLMs

StyleTunedLM: Customizing Large Language Model Generation Style using Parameter‑Efficient Finetuning

Steering Large Language Models with Register Analysis for Arbitrary Style Transfer

Style Vectors for Steering Generative Large Language Model

Style‑Specific Neurons for Steering LLMs in Text Style Transfer

SynDec: A Synthesize‑then‑Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

Steerable chatbots personalizing LLMs with preference‑based activation steering

LLMQwenReinforcement learningStyle Transferreward modelingASR datahuman‑like dialogue
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.