How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

This article details the design, challenges, and performance gains of an AI‑driven rental recommendation system that replaces a multi‑agent architecture with a single LLM using dynamic tool‑use, introduces a two‑stage reinforcement‑learning pipeline, and achieves sub‑second latency and higher accuracy for complex rental scenarios.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning

Background

AI‑driven recommendation is increasingly used in e‑commerce and service platforms. In the long‑cycle, high‑decision‑weight rental domain of 芝麻租赁, traditional recommendation suffers from three core problems: mismatched demand, low decision efficiency, and passive service.

Challenges

Demand mismatch : Photographers and camping beginners have very different concerns, and the platform cannot accurately identify and serve these divergent needs.

Decision efficiency : Critical contract information is scattered across product pages, forcing users to sift through an "information ocean" and raising the decision barrier.

Passive service : Scenario‑based requests (e.g., "need a projector for the annual meeting") only return a static product list without proactive, consultative suggestions.

Architecture Evolution

The initial multi‑agent pipeline (Query → Rewrite → Planning → Retrieval → Agent) caused high latency (average first‑token response 5.1 s) and functional overlap among agents, leading to high cost and poor scalability.

We replaced the multi‑agent split with a single large language model (LLM) that dynamically decides which atomic tool to invoke, following the ReAct paradigm (Reason → Act → Observe → Decide). This reduces serial latency and simplifies the decision flow.

Tool Design

Instead of a monolithic "all‑purpose" tool, we defined more than 15 atomic tools, each with a clear purpose and JSON schema. Example tool definition:

{
  "name": "knowledge_search",
  "description": "Search a knowledge base. Choose domain (internal_kb, web, xiaohongshu).",
  "parameters": {
    "query": "search keyword",
    "domain": {"type": "string", "enum": ["internal_kb", "web", "xiaohongshu"]}
  }
}

Typical recommendation flow for a camera rental request:

knowledge_search("西藏旅游 相机 轻便 拍星空") – understand user intent.

search_db(product_type="相机", brand="", models=[...]) – initial product retrieval.

search_db(product_type="相机", brand="") – broaden search if needed.

Return product list and generate final recommendation.

Tool parameters can be highly expressive; a product search request may contain 20+ fields such as product_type, brand, models, key_features, price_range, rental_duration, service_guarantees, etc.

Why Supervised Fine‑Tuning (SFT) Was Insufficient

Sparse learning signal : Tool‑call tokens ( <tool_call> and parameters) occupy a tiny fraction of the dialogue, making it hard for SFT to learn the correct calling strategy.

Learning format, not strategy : SFT teaches the model the syntax of calls but not when to call, which tool to choose, or how to use the result for subsequent decisions.

Two‑Stage Reinforcement Learning Solution

Stage 1 – Format Reinforcement (Rule‑Based Reward) : Strict syntax checks are applied to generated tool calls. Incorrect format receives a low reward, guiding the model to produce syntactically valid calls.

Stage 2 – Answer Optimization (LLM‑as‑Judge Reward) : A lightweight 4B LLM acts as a judge, scoring model responses on accuracy, completeness, and fluency. The reward is a continuous score between 0 and 1, encouraging high‑quality answers beyond mere correct tool usage.

Addressing Sparse Reward in Multi‑Step Tool‑Use

We introduced region‑specific clipping in the policy‑gradient update: a larger clipping range for tool‑call tokens (high‑impact decisions) and a smaller range for natural‑language tokens (stable generation). This allows aggressive exploration where it matters while keeping language generation stable.

Scaling the MoE Backbone

To reduce cost we optimized the Qwen3‑Next‑80B‑A3B Mixture‑of‑Experts (MoE) model. Traditional DeepSpeed Zero‑3 training incurred ~93 s per iteration due to heavy communication. By applying a 6‑dimensional parallelism strategy (Tensor Parallel = 4, Pipeline Parallel = 8, Expert Parallel = 2, Data Parallel = 1), we balanced compute and communication, achieving nearly a 10× speed‑up.

During inference we applied selective quantization (keeping self‑attention output projections and MoE expert up/down/gate layers in FP16) which preserved 99.5 % model accuracy while cutting memory usage by 40.6 %.

Quantitative Results

Overall accuracy improved from 88.32 % to 91.55 % (+3.23 %).

Parameter error rate dropped by 2.11 % and format‑hallucination by 0.87 %.

Complete recommendation success rate for non‑3C categories increased by 14.93 %.

End‑to‑end latency reduced from 2850 ms to 100 ms (first‑token latency from 5.1 s to 1.2 s after architecture change).

Conclusion

By consolidating multiple agents into a single LLM with a rich set of atomic tools and training it with a two‑stage reinforcement‑learning pipeline, we built a reliable, low‑latency AI rental advisor. The approach demonstrates that a small, focused team can achieve rapid breakthroughs when architecture, training methodology, and tooling are tightly aligned with real‑world challenges.

System Architecturelarge language modelReinforcement learningtool useAI recommendation
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.