How We Built a High‑Performance AI Rental Advisor with One‑Model Tool‑Use and Reinforcement Learning
This article details the design, challenges, and performance gains of an AI‑driven rental recommendation system that replaces a multi‑agent architecture with a single LLM using dynamic tool‑use, introduces a two‑stage reinforcement‑learning pipeline, and achieves sub‑second latency and higher accuracy for complex rental scenarios.
Background
AI‑driven recommendation is increasingly used in e‑commerce and service platforms. In the long‑cycle, high‑decision‑weight rental domain of 芝麻租赁, traditional recommendation suffers from three core problems: mismatched demand, low decision efficiency, and passive service.
Challenges
Demand mismatch : Photographers and camping beginners have very different concerns, and the platform cannot accurately identify and serve these divergent needs.
Decision efficiency : Critical contract information is scattered across product pages, forcing users to sift through an "information ocean" and raising the decision barrier.
Passive service : Scenario‑based requests (e.g., "need a projector for the annual meeting") only return a static product list without proactive, consultative suggestions.
Architecture Evolution
The initial multi‑agent pipeline (Query → Rewrite → Planning → Retrieval → Agent) caused high latency (average first‑token response 5.1 s) and functional overlap among agents, leading to high cost and poor scalability.
We replaced the multi‑agent split with a single large language model (LLM) that dynamically decides which atomic tool to invoke, following the ReAct paradigm (Reason → Act → Observe → Decide). This reduces serial latency and simplifies the decision flow.
Tool Design
Instead of a monolithic "all‑purpose" tool, we defined more than 15 atomic tools, each with a clear purpose and JSON schema. Example tool definition:
{
"name": "knowledge_search",
"description": "Search a knowledge base. Choose domain (internal_kb, web, xiaohongshu).",
"parameters": {
"query": "search keyword",
"domain": {"type": "string", "enum": ["internal_kb", "web", "xiaohongshu"]}
}
}Typical recommendation flow for a camera rental request:
knowledge_search("西藏旅游 相机 轻便 拍星空") – understand user intent.
search_db(product_type="相机", brand="", models=[...]) – initial product retrieval.
search_db(product_type="相机", brand="") – broaden search if needed.
Return product list and generate final recommendation.
Tool parameters can be highly expressive; a product search request may contain 20+ fields such as product_type, brand, models, key_features, price_range, rental_duration, service_guarantees, etc.
Why Supervised Fine‑Tuning (SFT) Was Insufficient
Sparse learning signal : Tool‑call tokens ( <tool_call> and parameters) occupy a tiny fraction of the dialogue, making it hard for SFT to learn the correct calling strategy.
Learning format, not strategy : SFT teaches the model the syntax of calls but not when to call, which tool to choose, or how to use the result for subsequent decisions.
Two‑Stage Reinforcement Learning Solution
Stage 1 – Format Reinforcement (Rule‑Based Reward) : Strict syntax checks are applied to generated tool calls. Incorrect format receives a low reward, guiding the model to produce syntactically valid calls.
Stage 2 – Answer Optimization (LLM‑as‑Judge Reward) : A lightweight 4B LLM acts as a judge, scoring model responses on accuracy, completeness, and fluency. The reward is a continuous score between 0 and 1, encouraging high‑quality answers beyond mere correct tool usage.
Addressing Sparse Reward in Multi‑Step Tool‑Use
We introduced region‑specific clipping in the policy‑gradient update: a larger clipping range for tool‑call tokens (high‑impact decisions) and a smaller range for natural‑language tokens (stable generation). This allows aggressive exploration where it matters while keeping language generation stable.
Scaling the MoE Backbone
To reduce cost we optimized the Qwen3‑Next‑80B‑A3B Mixture‑of‑Experts (MoE) model. Traditional DeepSpeed Zero‑3 training incurred ~93 s per iteration due to heavy communication. By applying a 6‑dimensional parallelism strategy (Tensor Parallel = 4, Pipeline Parallel = 8, Expert Parallel = 2, Data Parallel = 1), we balanced compute and communication, achieving nearly a 10× speed‑up.
During inference we applied selective quantization (keeping self‑attention output projections and MoE expert up/down/gate layers in FP16) which preserved 99.5 % model accuracy while cutting memory usage by 40.6 %.
Quantitative Results
Overall accuracy improved from 88.32 % to 91.55 % (+3.23 %).
Parameter error rate dropped by 2.11 % and format‑hallucination by 0.87 %.
Complete recommendation success rate for non‑3C categories increased by 14.93 %.
End‑to‑end latency reduced from 2850 ms to 100 ms (first‑token latency from 5.1 s to 1.2 s after architecture change).
Conclusion
By consolidating multiple agents into a single LLM with a rich set of atomic tools and training it with a two‑stage reinforcement‑learning pipeline, we built a reliable, low‑latency AI rental advisor. The approach demonstrates that a small, focused team can achieve rapid breakthroughs when architecture, training methodology, and tooling are tightly aligned with real‑world challenges.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
