Building Xiaomi’s Vertical Domain QA Agent: From RAG to Real‑World Deployment
This article explains how Xiaomi designed and deployed a vertical‑domain question‑answering assistant for product and car queries, covering business background, a four‑module RAG‑plus‑LLM architecture, knowledge‑base construction, custom chunking strategies, dynamic signal handling, and the challenges overcome to achieve reliable real‑time voice interactions.
Introduction
Voice interaction is a key user interface for many Xiaomi products. Users ask the XiaoAi voice assistant for various needs, which can be divided into general Q&A and vertical‑domain Q&A. General Q&A covers content queries such as music or encyclopedia, while vertical Q&A requires private domain knowledge that cannot be answered directly by large‑model world knowledge.
Business Background
Xiaomi’s vertical‑domain Q&A includes two major scenarios: the Product Assistant and the Car Q&A Assistant.
Product Assistant handles queries about Xiaomi’s extensive product catalog, including product specifications, after‑sale service policies, and real‑time device status (e.g., battery level, system settings).
Car Q&A Assistant addresses car‑related queries such as locating settings in the infotainment system, retrieving real‑time vehicle signals, and consulting the car manual for operation instructions.
Technical Solution
The solution follows a concise four‑module architecture to meet strict latency requirements for real‑time voice interaction.
AgentParser : Uses a lightweight large model (1‑4B parameters) for semantic understanding and simple intent filtering, outputting a function code.
AgentSkill : Executes operations based on the function code, including RAG retrieval, API calls, and on‑device signal queries, and assembles the results into a prompt for the large model.
LLM Generation & Post‑Processing : Calls the large model to generate the final answer and performs post‑processing such as Markdown link handling and image‑text mixing.
Knowledge Base : Builds a generic RAG platform for internal Xiaomi data (product encyclopedia, customer service QA, manuals, etc.) and supports privacy‑compliant storage and retrieval.
Before invoking the large model, the system decides whether a query truly needs model inference; simple frequent queries are answered directly with function code to reduce latency.
Implementation Details
Knowledge base construction relies on multi‑source data (product pages, manuals, QA logs). Data is first crawled, then merged and updated in real time to keep price and stock information fresh. Structured data is stored as JSON schemas, while unstructured text is chunked using custom strategies (e.g., "goods + attr + value"). For the car domain, static knowledge (manuals) and dynamic signals (sensor status) are handled separately; dynamic signals are normalized into textual descriptions before vectorization.
Chunking strategies are tailored to each vertical: title‑based chunking for manuals, field‑concatenation for product specs, and length‑controlled semantic splitting for large documents. Vector embeddings are generated for each chunk, and a three‑stage retrieval pipeline (coarse recall → re‑ranking → LLM scoring) is employed. Early experiments used a bge‑reranker model, later replaced by direct LLM re‑ranking to keep the candidate set under 100 items.
Model training includes supervised fine‑tuning (SFT) on carefully curated data covering 7 core capabilities (knowledge summarization, specific information extraction, complex reasoning, multi‑turn interaction, coreference resolution, fallback responses, etc.). A lightweight 7B model is distilled from larger models and further refined with SFT and optional DPO for style alignment.
Challenges
Key challenges include:
Efficiently vectorizing massive, heterogeneous knowledge sources while preserving hierarchical relationships.
Ensuring real‑time freshness of dynamic signals and product data.
Designing custom chunking and indexing strategies that boost retrieval effectiveness from 80% to over 90% in domain‑specific tests.
Balancing latency and accuracy by pre‑recognizing intents and limiting LLM prompt length.
Summary
The main takeaways are:
Build a multi‑source knowledge base with tailored chunking and vectorization.
Optimize retrieval models using early‑stage rerankers and later LLM‑based ranking.
Introduce multi‑turn query rewriting to improve recall for complex queries.
Align responses with brand guidelines using SFT + DPO pipelines.
Q&A
Q1: How does a custom chunking strategy integrate with a generic RAG platform? The platform provides an extensible interface; developers package their chunking logic according to the agreed protocol, allowing seamless deployment of domain‑specific strategies.
A1: By adhering to the interface, custom chunkers can be plugged in without modifying the core RAG service.
Q2: How to evaluate the quality of query rewriting? Use both quantitative metrics (e.g., ROUGE‑L) and business‑oriented metrics such as recall and accuracy improvements in downstream retrieval tasks.
A2: Quantitative scores measure textual similarity, while business metrics directly reflect the impact on retrieval performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
