Artificial Intelligence 16 min read

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

This article examines how reinforcement‑learning techniques such as PPO, DPO, and GRPO are integrated into the Baixiaosheng QA system to improve answer stability, deepen domain knowledge understanding, and accelerate response generation, and it evaluates the impact of Reinforcement Fine‑Tuning (RFT) on real‑world performance.

Zhuanzhuan Tech

Oct 29, 2025

How Reinforcement Learning Boosts Stability and Speed in LLM QA Systems

Part 1: Introduction

With the rapid development of artificial‑intelligence technology, intelligent QA systems have become a crucial bridge between information and users, reshaping human‑computer interaction and enhancing service efficiency. "Baixiaosheng" is a RAG‑based large‑language‑model product that provides precise quality‑inspection knowledge answers for field engineers. After a year of iteration, the system has grown from a 10% pilot to nationwide coverage, serving over 3,000 engineers daily with a 90%+ answer‑accuracy rate. Its architecture evolved from a simple RAG pipeline to a complex agent that supports multi‑turn dialogue, proactive clarification, and mixed‑media answer generation.

As accuracy improves, optimization challenges shift from factual errors to answer stability, deep understanding of complex quality‑inspection knowledge, and faster response (streamlined reasoning). Traditional methods struggle with these continuous‑learning scenarios, prompting the exploration of reinforcement‑learning (RL) techniques—especially RLHF and the newer RLVR—to move models from generating seemingly correct answers to truly correct ones.

Integrating RL deeply into Baixiaosheng promises higher retrieval‑generation quality and continual improvement through human‑in‑the‑loop feedback, ultimately enhancing both precision and user satisfaction.

Part 2: RL Technique Options – PPO, DPO, and GRPO Overview

2.1 Proximal Policy Optimization (PPO)

PPO (Proximal Policy Optimization) limits the magnitude of each policy update to ensure training stability.

2.1.1 Core Formula

θ ← argmax_θ E[clip(r(θ),1‑ε,1+ε)·Â]

(illustrative; actual formula omitted for brevity)

2.1.2 Components

Importance‑sampling ratio (θ) measures the change from the old to the new policy.

Advantage function evaluates how much better an action is compared to the average action in a given state.

Advantage calculation involves a reward model (RM) that scores generated answers and a critic network that estimates the expected return.

ε (typically 0.1–0.2) clips the probability ratio to the interval [1‑ε, 1+ε].

2.2 Direct Preference Optimization (DPO)

DPO is an offline, reward‑model‑free preference‑learning algorithm that derives the optimal policy directly from pairwise human preferences, bypassing explicit reward modeling.

2.2.1 Core Formula

max_θ Σ_{(x,a^+,a^−)∈D} log σ( f_θ(x,a^+) – f_θ(x,a^−) )

2.2.2 Components

Samples (x, a⁺, a⁻) from a preference dataset D, where a⁺ is the chosen answer and a⁻ the rejected one.

A reference policy (the pre‑fine‑tuned model) prevents the new policy from drifting too far from the original capabilities.

A hyper‑parameter controls sensitivity to reward differences and deviation from the reference policy.

2.3 Group‑Relative Policy Optimization (GRPO)

GRPO improves upon PPO by reducing reliance on an external critic and estimating advantages through relative comparisons within a batch of candidate outputs, which is especially suitable for discrete rewards and large‑scale LLM fine‑tuning.

2.3.1 Core Formula

Advantage_i = (r_i – μ_G) / (σ_G + ε)

where r_i is the reward of candidate i, μ_G and σ_G are the mean and standard deviation of rewards in the group, and ε prevents division by zero.

2.3.2 Components

q: the prompt (question); G: group size (number of candidates generated per prompt).

Old‑policy samples: a set of candidate outputs generated by the previous policy.

Group‑relative advantage computed as above; clipping with ε works like PPO’s clip function.

2.4 RL Technique Summary

In large‑model alignment, PPO, DPO, and GRPO are three mainstream optimization algorithms that aim to make model outputs align with human preferences. PPO uses a clipping mechanism within an actor‑critic framework to ensure stable updates. DPO eliminates the need for a separate reward model by directly optimizing the policy with preference data, offering simplicity and strong stability. GRPO leverages intra‑group comparisons to provide advantage signals without a critic, making training lighter and well‑suited for resource‑constrained scenarios.

Part 3: Baixiaosheng System Reinforcement‑Fine‑Tuning (RFT) Practice

3.1 Reinforcement Fine‑Tuning (RFT)

Supervised fine‑tuning (SFT) is common for domain adaptation, but poor data quality or excessive epochs can cause over‑fitting and catastrophic forgetting. RFT combines SFT with RL (e.g., GRPO) in multiple stages, allowing SFT to establish solid base behavior while RL refines the model with reward signals for complex preferences and reasoning.

3.2 Baixiaosheng RFT Implementation

The Baixiaosheng QA model is trained with a two‑stage SFT + RL pipeline, where the RL stage employs the GRPO algorithm.

3.2.1 System Overview

In ZhiZhuan’s on‑site recycling service, engineers first perform device inspection and then select status options in the system, which automatically generates a recycling price. Accurate option selection is critical for fair valuation.

Baixiaosheng provides real‑time, standardized guidance. For example, when an engineer encounters “third‑party marking on the motherboard,” the system prompts the selection of “Motherboard‑Repair → Motherboard‑Has‑Third‑Party‑Marking” while noting exceptions such as “non‑repair markings.”

3.2.2 GRPO Reward Design and Training

During GRPO training, N candidate answers are generated per question. Rewards above the group mean are encouraged; those below are penalized. Two reward functions are used:

Similarity Reward: Measures semantic similarity between the model’s answer and a reference answer using deepseek‑V3; BERT is distilled from deepseek‑V3 for faster scoring.

Repetition Penalty: Uses an embedding model to assess answer redundancy, discouraging “repetitive‑machine” outputs.

Scoring example (illustrating similarity and repetition scores) is shown below:

3.2.3 Training Process

Key signals:

KL divergence (kl) limits deviation from the original model, preventing catastrophic forgetting.

CrossEncoderSimilarityORM provides the similarity reward.

AntiRepetitionThoughtORM supplies the repetition penalty; lower repetition yields higher scores.

3.2.4 Effect Evaluation

After applying RFT to the Qwen3‑8B model, Baixiaosheng’s answer accuracy reached 94.05% , comparable to a 200 B‑parameter model. Using the same data, RFT improved accuracy by 6% over pure SFT.

Repetition rate dropped to 0% , and average generation time decreased from 40 seconds (baseline) to about 10 seconds . Consistency evaluation showed a correlation coefficient of 0.85 across multiple generations, surpassing the baseline’s 0.76, indicating more reliable outputs.

Part 4: Conclusion and Outlook

RFT, as an emerging large‑model training paradigm, has demonstrated significant advantages in reasoning, mathematics, and code‑generation benchmarks. Experiments confirm that RFT can effectively boost key business metrics in the Baixiaosheng system, mitigating common fine‑tuning drawbacks such as repetitive outputs and catastrophic forgetting.

Future directions include designing finer‑grained, business‑aligned reward functions and extending RFT to multimodal tasks (e.g., image‑based quality‑inspection QA) to further expand model capabilities.

AI large language model GRPO PPO DPO question answering RFT

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.