Boost Java Agent Performance with End‑to‑End Online Training Using Trinity‑RFT
This article explains how to overcome the training‑deployment gap for Java‑based AI agents by introducing a cloud‑native, low‑intrusion online training pipeline built on AgentScope Java and Trinity‑RFT, detailing architecture, configuration, custom selection and reward strategies, and showing measurable accuracy gains on a SQL‑Agent benchmark.
Background and Challenges
Large‑language‑model (LLM) agents are moving from prototypes to production use cases such as customer‑service automation, operations diagnostics, data querying, and business‑process orchestration. Developers typically fine‑tune agents offline with supervised (SFT) or reinforcement (RFT) learning, which creates a mismatch between training and real‑world deployment.
Core Challenges
Training‑deployment environment separation : Offline data collection, manual cleaning, and mock toolchains cannot faithfully reproduce production latency, concurrency, or context, causing models that perform well offline to fail in production.
Lack of Java‑ecosystem support : Existing RFT frameworks (e.g., Trinity‑RFT) target Python, forcing Java teams to rewrite agents in Python or build heavyweight adapters.
Proposed Solution
We introduce an end‑to‑end online training solution for Java agents that closes the data loop from production to training. The design emphasizes three traits:
Use real production data : Agents are trained directly on live request‑tool interactions.
Low intrusion : No changes to existing business logic; integration cost is minimal.
Language‑stack friendly : Native Java support without cross‑language rewrites.
Architecture Overview
The system consists of three decoupled components:
Agent Runner – Deployed by developers, handles real user requests and communicates with Explorer via HTTP. No GPU required.
Explorer (Inference Service) – Receives requests, performs LLM inference, calls tools, records full interaction traces, and exposes an OpenAI‑compatible API.
Trainer (Training Service) – Reads new traces from shared storage (SQLite or SQL DB), runs SFT/RFT training, and writes updated checkpoints back for Explorer to hot‑load.
Online Training Process
The system automatically samples high‑quality requests from live traffic.
The agent processes each request with the current model, invoking real tools (e.g., APIs, databases).
The full interaction (input, output, tool calls, state changes) is recorded and, together with a reward signal, forms a training sample.
When enough rewarded trajectories are collected, the Trainer incrementally updates the model, achieving a "use‑and‑learn" feedback loop.
Implementation Details
Maven Dependency
<dependency>
<groupId>io.agentscope</groupId>
<artifactId>agentscope-extensions-training</artifactId>
<version>${agentscope.version}</version>
</dependency>Request Selection Strategies
Two built‑in strategies are provided; developers can also implement custom logic.
SamplingRateStrategy – Randomly selects a percentage of online requests (e.g., 10%).
TrainingSelectionStrategy strategy = SamplingRateStrategy.of(0.1); // 10%ExplicitMarkingStrategy – Developers explicitly mark high‑value requests in code.
TrainingSelectionStrategy strategy = ExplicitMarkingStrategy.create();
TrainingContext.mark("high-quality", "user-feedback");
agent.call(msg).block(); // this request will be used for trainingCustom Reward Function
Implement the RewardCalculator interface and return a float in [0,1] based on factors such as tool success, response relevance, or external feedback.
public class CustomReward implements RewardCalculator {
@Override
public double calculate(Agent agent) {
// Example: combine execution success (0.6) and user rating (0.4)
return 0.6 * executionScore + 0.4 * userRating;
}
}Installation of Trinity‑RFT
Prerequisites: Python 3.10‑3.12, CUDA ≥12.8, at least two GPUs.
git clone https://github.com/agentscope-ai/Trinity-RFT
cd Trinity-RFT
pip install -e ".[dev]"
pip install flash-attn==2.8.1Configuration Files
Explorer (serve) config (explorer.yaml)
mode: serve
project: test
name: test
checkpoint_root_dir: /shared/checkpoints
model:
model_path: /path/to/model
max_model_len: 8192
max_response_tokens: 2048
temperature: 0.7
algorithm:
algorithm_type: "ppo"
cluster:
node_num: 1
gpu_per_node: 4
explorer:
rollout_model:
engine_num: 2
tensor_parallel_size: 2
enable_openai_api: true
enable_history: true
enable_auto_tool_choice: true
tool_call_parser: hermes
dtype: bfloat16
seed: 42
service_status_check_interval: 10
proxy_port: 8010
buffer:
train_batch_size: 16
trainer_input:
experience_buffer:
name: exp_buffer
storage_type: sql
synchronizer:
sync_method: checkpoint
sync_interval: 1
monitor:
monitor_type: tensorboardTrainer (train) config (trainer.yaml)
mode: train
project: test
name: test
checkpoint_root_dir: /shared/checkpoints
model:
model_path: /path/to/model
max_model_len: 8192
max_response_tokens: 2048
temperature: 0.7
algorithm:
algorithm_type: "ppo"
cluster:
node_num: 1
gpu_per_node: 4
buffer:
train_batch_size: 32
trainer_input:
experience_buffer:
name: exp_buffer
storage_type: sql
trainer:
save_interval: 16
ulysses_sequence_parallel_size: 1
save_hf_checkpoint: always
max_checkpoints_to_keep: 5
trainer_config:
trainer:
balance_batch: false
max_actor_ckpt_to_keep: 5
max_critic_ckpt_to_keep: 5
synchronizer:
sync_method: checkpoint
sync_interval: 1
monitor:
monitor_type: tensorboardRunning the Services
# Start Ray cluster
ray start --head
# Launch Explorer and Trainer
trinity run --config explorer.yaml
trinity run --config trainer.yamlJava Runner Example
TrainingRunner trainingRunner = TrainingRunner.builder()
.trinityEndpoint("http://trinity-backend:8010")
.modelName("/path/to/model")
.selectionStrategy(SamplingRateStrategy.of(0.1)) // 10% sampling
.rewardCalculator(new CustomReward())
.commitIntervalSeconds(300) // commit every 5 minutes
.build();
trainingRunner.start();
ReActAgent agent = ReActAgent.builder()
.name("Assistant")
.sysPrompt("You are a helpful AI assistant. Be friendly and concise.")
.model(DashScopeChatModel.builder()
.apiKey(apiKey)
.modelName("qwen-plus")
.stream(true)
.formatter(new DashScopeChatFormatter())
.build())
.memory(new InMemoryMemory())
.toolkit(new Toolkit())
.build();
Msg response = agent.call(Msg.userMsg("Search Python tutorials")).block();
trainingRunner.stop();SQL Agent Demo
The online‑training pipeline is applied to a SQL‑generation agent (Qwen2.5‑Coder‑1.5B‑Instruct). The agent receives natural‑language questions, generates SQL, executes it against a real database, and iteratively refines the query if execution fails. Reward logic combines execution success and LLM‑based relevance scoring.
Evaluation Setup
We evaluate on the Spider test set (1000 queries) before and after training, measuring execution accuracy across difficulty levels.
Results Before Training
## Summary
- **Total Samples:** 1000
- **Execution Accuracy:** 47.60%
## Scores by Difficulty
| Difficulty | Count | Exec Accuracy | Percentage |
|------------|-------|---------------|------------|
| easy | 327 | 0.612 | 61.16% |
| medium | 445 | 0.449 | 44.94% |
| hard | 140 | 0.357 | 35.71% |
| extra | 88 | 0.295 | 29.55% |
| all | 1000| 0.476 | 47.60% |
---Results After Training
## Summary
- **Total Samples:** 1000
- **Success Count:** 1000
- **Error Count:** 0
- **Success Rate:** 100.00%
- **Execution Accuracy:** 65.70% (based on 1000 successful evaluations)
## Scores by Difficulty
| Difficulty | Count | Exec Accuracy | Percentage |
|------------|-------|---------------|------------|
| easy | 327 | 0.844 | 84.40% |
| medium | 445 | 0.616 | 61.57% |
| hard | 140 | 0.529 | 52.86% |
| extra | 88 | 0.375 | 37.50% |
| all | 1000| 0.657 | 65.70% |
---Training improved execution accuracy by 18.1% overall, with gains of 23.24% (easy), 16.63% (medium), 17.15% (hard), and 7.95% (extra).
Community and Resources
Project repository: https://github.com/agentscope-ai/agentscope-java
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
