Human‑like LLM Replies for Live Digital Hosts: ASR‑Based Style Transfer and Reward Modeling
This article proposes an ASR‑driven pipeline that creates high‑quality AI‑reply vs. human‑like reply pairs, trains a rewrite model and a reward model, and uses GRPO reinforcement learning to generate natural, helpful, and less AI‑sounding responses in digital‑human live streaming, achieving 92% accuracy and 97% helpfulness while improving user experience.
The rapid rise of large language models (LLMs) has enabled many text‑generation scenarios, but in digital‑human live streaming the generated replies often feel overly formal and exhibit a strong "AI" impression, which degrades the realism of voice‑over after text‑to‑speech conversion.
Problem Statement
Although LLMs provide correct and helpful answers, their responses are typically written in a formal style, lacking the conversational tone required for live digital avatars. Existing solutions—prompt engineering, persona prompts, or fine‑tuning on generic data—only partially reduce the AI feel and still fall short of true human‑like interaction.
Proposed Solution
The authors introduce a two‑stage approach based on real‑time ASR data from live streams:
Human‑like Training Data Generation : Build a high‑quality <AI‑reply, human‑like reply> pair dataset by automatically cleaning noisy ASR transcripts, aligning them with corresponding product details, and constructing parallel rewrite pairs. This pipeline includes ASR quality filtering, logical repair, and content‑consistency alignment.
Reward‑Based Reinforcement Learning : Train a rewrite model via supervised fine‑tuning (SFT) on the cleaned pairs, then develop a binary reward model that judges the human‑likeness of a reply. Integrate this reward into a GRPO (Generalized Reward‑Based Policy Optimization) RL loop, allowing the base model to directly generate human‑like responses without a separate rewrite step.
Data Cleaning Pipeline
The cleaning process consists of two parts:
Part 1 – ASR Cleaning : Filter out short or ambiguous ASR, generate product details and questions, and repair broken sentence boundaries.
Part 2 – Content Alignment : Ensure the ASR and generated online replies convey the same meaning, then filter out mismatched or overly short pairs. The resulting dataset contains over 3 k high‑quality pairs derived from 30 k raw ASR entries.
Model Training
Various Qwen‑2.5 models (1.5B, 7B, 72B) were evaluated; the 7B‑instruct variant offered the best trade‑off between performance and latency. Fine‑tuning hyper‑parameters were:
learning_rate=2e-5</code>
<code>epoch=4</code>
<code>lr_scheduler_type='cosine'Training curves show that the cleaned data reduces initial loss and stabilizes at a lower final loss, indicating easier learning of the style transfer.
Reward Model and GRPO Training
The reward model is a BERT‑based binary classifier trained on <AI‑reply, human‑like reply> pairs, with careful masking of frequent style tokens to avoid shortcut learning. During GRPO, multiple rewards (accuracy, helpfulness, style, length) are combined; length rewards are adjusted to encourage replies 1.2–1.5× longer than the reference.
Experimental Results
Evaluation on a held‑out test set shows the final system achieves 92.0% accuracy and 97.0% helpfulness, with human judges rating 85.5% of the generated audio as more natural than the baseline. Sample comparisons illustrate improvements in word order, emphasis of key points, and removal of formal phrasing.
Conclusion and Future Work
The ASR‑driven data pipeline and reward‑augmented RL effectively reduce AI‑sense in live‑stream replies while preserving correctness. Future directions include developing a graded human‑likeness scoring system, incorporating reasoning steps missing from ASR, extending the method to longer texts, and exploring factor‑based style transfer techniques.
References
Distilling Text Style Transfer With Self‑Explanation From LLMs
StyleTunedLM: Customizing Large Language Model Generation Style using Parameter‑Efficient Finetuning
Steering Large Language Models with Register Analysis for Arbitrary Style Transfer
Style Vectors for Steering Generative Large Language Model
Style‑Specific Neurons for Steering LLMs in Text Style Transfer
SynDec: A Synthesize‑then‑Decode Approach for Arbitrary Textual Style Transfer via Large Language Models
Steerable chatbots personalizing LLMs with preference‑based activation steering
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
