Boost Java Agent Performance with End‑to‑End Online Training Using Trinity‑RFT

This article explains how to overcome the training‑deployment gap for Java‑based AI agents by introducing a cloud‑native, low‑intrusion online training pipeline built on AgentScope Java and Trinity‑RFT, detailing architecture, configuration, custom selection and reward strategies, and showing measurable accuracy gains on a SQL‑Agent benchmark.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Boost Java Agent Performance with End‑to‑End Online Training Using Trinity‑RFT

Background and Challenges

Large‑language‑model (LLM) agents are moving from prototypes to production use cases such as customer‑service automation, operations diagnostics, data querying, and business‑process orchestration. Developers typically fine‑tune agents offline with supervised (SFT) or reinforcement (RFT) learning, which creates a mismatch between training and real‑world deployment.

Core Challenges

Training‑deployment environment separation : Offline data collection, manual cleaning, and mock toolchains cannot faithfully reproduce production latency, concurrency, or context, causing models that perform well offline to fail in production.

Lack of Java‑ecosystem support : Existing RFT frameworks (e.g., Trinity‑RFT) target Python, forcing Java teams to rewrite agents in Python or build heavyweight adapters.

Proposed Solution

We introduce an end‑to‑end online training solution for Java agents that closes the data loop from production to training. The design emphasizes three traits:

Use real production data : Agents are trained directly on live request‑tool interactions.

Low intrusion : No changes to existing business logic; integration cost is minimal.

Language‑stack friendly : Native Java support without cross‑language rewrites.

Architecture Overview

The system consists of three decoupled components:

Agent Runner – Deployed by developers, handles real user requests and communicates with Explorer via HTTP. No GPU required.

Explorer (Inference Service) – Receives requests, performs LLM inference, calls tools, records full interaction traces, and exposes an OpenAI‑compatible API.

Trainer (Training Service) – Reads new traces from shared storage (SQLite or SQL DB), runs SFT/RFT training, and writes updated checkpoints back for Explorer to hot‑load.

Architecture diagram
Architecture diagram

Online Training Process

The system automatically samples high‑quality requests from live traffic.

The agent processes each request with the current model, invoking real tools (e.g., APIs, databases).

The full interaction (input, output, tool calls, state changes) is recorded and, together with a reward signal, forms a training sample.

When enough rewarded trajectories are collected, the Trainer incrementally updates the model, achieving a "use‑and‑learn" feedback loop.

Implementation Details

Maven Dependency

<dependency>
  <groupId>io.agentscope</groupId>
  <artifactId>agentscope-extensions-training</artifactId>
  <version>${agentscope.version}</version>
</dependency>

Request Selection Strategies

Two built‑in strategies are provided; developers can also implement custom logic.

SamplingRateStrategy – Randomly selects a percentage of online requests (e.g., 10%).

TrainingSelectionStrategy strategy = SamplingRateStrategy.of(0.1); // 10%

ExplicitMarkingStrategy – Developers explicitly mark high‑value requests in code.

TrainingSelectionStrategy strategy = ExplicitMarkingStrategy.create();
TrainingContext.mark("high-quality", "user-feedback");
agent.call(msg).block(); // this request will be used for training

Custom Reward Function

Implement the RewardCalculator interface and return a float in [0,1] based on factors such as tool success, response relevance, or external feedback.

public class CustomReward implements RewardCalculator {
    @Override
    public double calculate(Agent agent) {
        // Example: combine execution success (0.6) and user rating (0.4)
        return 0.6 * executionScore + 0.4 * userRating;
    }
}

Installation of Trinity‑RFT

Prerequisites: Python 3.10‑3.12, CUDA ≥12.8, at least two GPUs.

git clone https://github.com/agentscope-ai/Trinity-RFT
cd Trinity-RFT
pip install -e ".[dev]"
pip install flash-attn==2.8.1

Configuration Files

Explorer (serve) config (explorer.yaml)

mode: serve
project: test
name: test
checkpoint_root_dir: /shared/checkpoints
model:
  model_path: /path/to/model
  max_model_len: 8192
  max_response_tokens: 2048
  temperature: 0.7
algorithm:
  algorithm_type: "ppo"
cluster:
  node_num: 1
  gpu_per_node: 4
explorer:
  rollout_model:
    engine_num: 2
    tensor_parallel_size: 2
    enable_openai_api: true
    enable_history: true
    enable_auto_tool_choice: true
    tool_call_parser: hermes
    dtype: bfloat16
    seed: 42
service_status_check_interval: 10
proxy_port: 8010
buffer:
  train_batch_size: 16
  trainer_input:
    experience_buffer:
      name: exp_buffer
      storage_type: sql
synchronizer:
  sync_method: checkpoint
  sync_interval: 1
monitor:
  monitor_type: tensorboard

Trainer (train) config (trainer.yaml)

mode: train
project: test
name: test
checkpoint_root_dir: /shared/checkpoints
model:
  model_path: /path/to/model
  max_model_len: 8192
  max_response_tokens: 2048
  temperature: 0.7
algorithm:
  algorithm_type: "ppo"
cluster:
  node_num: 1
  gpu_per_node: 4
buffer:
  train_batch_size: 32
  trainer_input:
    experience_buffer:
      name: exp_buffer
      storage_type: sql
trainer:
  save_interval: 16
  ulysses_sequence_parallel_size: 1
  save_hf_checkpoint: always
  max_checkpoints_to_keep: 5
  trainer_config:
    trainer:
      balance_batch: false
      max_actor_ckpt_to_keep: 5
      max_critic_ckpt_to_keep: 5
synchronizer:
  sync_method: checkpoint
  sync_interval: 1
monitor:
  monitor_type: tensorboard

Running the Services

# Start Ray cluster
ray start --head
# Launch Explorer and Trainer
trinity run --config explorer.yaml
trinity run --config trainer.yaml

Java Runner Example

TrainingRunner trainingRunner = TrainingRunner.builder()
    .trinityEndpoint("http://trinity-backend:8010")
    .modelName("/path/to/model")
    .selectionStrategy(SamplingRateStrategy.of(0.1)) // 10% sampling
    .rewardCalculator(new CustomReward())
    .commitIntervalSeconds(300) // commit every 5 minutes
    .build();
trainingRunner.start();

ReActAgent agent = ReActAgent.builder()
    .name("Assistant")
    .sysPrompt("You are a helpful AI assistant. Be friendly and concise.")
    .model(DashScopeChatModel.builder()
        .apiKey(apiKey)
        .modelName("qwen-plus")
        .stream(true)
        .formatter(new DashScopeChatFormatter())
        .build())
    .memory(new InMemoryMemory())
    .toolkit(new Toolkit())
    .build();

Msg response = agent.call(Msg.userMsg("Search Python tutorials")).block();
trainingRunner.stop();

SQL Agent Demo

The online‑training pipeline is applied to a SQL‑generation agent (Qwen2.5‑Coder‑1.5B‑Instruct). The agent receives natural‑language questions, generates SQL, executes it against a real database, and iteratively refines the query if execution fails. Reward logic combines execution success and LLM‑based relevance scoring.

Evaluation Setup

We evaluate on the Spider test set (1000 queries) before and after training, measuring execution accuracy across difficulty levels.

Results Before Training

## Summary
- **Total Samples:** 1000
- **Execution Accuracy:** 47.60%
## Scores by Difficulty
| Difficulty | Count | Exec Accuracy | Percentage |
|------------|-------|---------------|------------|
| easy   | 327 | 0.612 | 61.16% |
| medium | 445 | 0.449 | 44.94% |
| hard   | 140 | 0.357 | 35.71% |
| extra  | 88  | 0.295 | 29.55% |
| all    | 1000| 0.476 | 47.60% |
---

Results After Training

## Summary
- **Total Samples:** 1000
- **Success Count:** 1000
- **Error Count:** 0
- **Success Rate:** 100.00%
- **Execution Accuracy:** 65.70% (based on 1000 successful evaluations)
## Scores by Difficulty
| Difficulty | Count | Exec Accuracy | Percentage |
|------------|-------|---------------|------------|
| easy   | 327 | 0.844 | 84.40% |
| medium | 445 | 0.616 | 61.57% |
| hard   | 140 | 0.529 | 52.86% |
| extra  | 88  | 0.375 | 37.50% |
| all    | 1000| 0.657 | 65.70% |
---

Training improved execution accuracy by 18.1% overall, with gains of 23.24% (easy), 16.63% (medium), 17.15% (hard), and 7.95% (extra).

Community and Resources

Project repository: https://github.com/agentscope-ai/agentscope-java

JavaLLMAgentOnlineTrainingTrinity-RFT
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.