Designing a Multi‑Language, Multi‑Business LLM‑Powered Customer Service QA System
Didi's International Business Group built an LLM‑driven quality‑inspection platform for Spanish and Portuguese support across ride‑hailing, food delivery, and finance, using three pipelines—intent verification, compliance assessment, and VOC trend analysis—that boosted intent accuracy to 86%, compliance accuracy above 90%, and cut manual reporting time from hours to minutes.
Didi International Business Group (IBG) created an intelligent customer‑service quality‑inspection system that supports Spanish and Portuguese across three business lines—ride‑hailing, food delivery, and finance. The platform runs three parallel pipelines—intent verification, compliance assessment, and VOC trend analysis—each delivering a full reasoning trace, thereby replacing opaque third‑party solutions with a transparent, self‑iterable AI architecture. The intent pipeline raised accuracy from under 40% to 86%; the compliance pipeline consistently exceeds 90% accuracy; the VOC pipeline reduces manual aggregation from several hours to a few minutes.
Business challenges included lack of traceability in black‑box results, exploding rule‑maintenance cost due to the multi‑language, multi‑line matrix, slow response to standard changes, and no proactive trend‑detection capability.
System overview addresses these issues by separating concerns into three dedicated pipelines, each outputting structured results and the underlying inference process.
Intent pipeline initially tried a direct classification approach where the LLM received the full Contact Reason (CR) taxonomy and the dialogue and returned a label. Prompt tweaks failed to improve the low accuracy. Analysis revealed that the bottleneck was the call architecture—what information the model could see mattered more than prompt wording. A second version refactored the architecture to tightly control the context visible to the LLM at each step, yielding a noticeable accuracy jump. A third version improved the CR tag definitions in the data layer, adding richer, less ambiguous descriptions. After these iterations the pipeline achieved 86% accuracy. The authors conclude that (1) architecture‑level information gating outweighs prompt engineering, and (2) the ceiling of LLM classification is set by the quality of label definitions.
Evaluation pipeline had to handle many language‑business combinations without writing separate rule sets. The solution externalized all variable elements—quality‑inspection standards, business‑insight (BI) questions—into structured configuration files. At runtime the system assembles a complete prompt based on the ticket’s language and line, then issues a single LLM call that simultaneously produces multiple compliance scores and BI analyses, enforced by a JSON‑Schema‑based tool‑use contract. Deterministic post‑processing validates rule‑based metrics (e.g., spelling error counts). An upstream filter discards irrelevant dialogs (pure transfers, no‑reply tickets), saving cost and preventing noise. This pipeline maintains an average compliance accuracy above 90%.
VOC pipeline follows a three‑stage design. Stage 1 extracts structured fields (issue type, sentiment, resolution, root cause) from each dialogue via independent LLM calls, enabling parallel processing. Stage 2 clusters issue‑type labels using embedding similarity, merges synonyms, and ranks clusters by frequency—this step is deterministic and reproducible. Stage 3 generates a management‑ready report that includes an executive summary, pain‑point analysis, and actionable recommendations. A real‑world case identified a surge of cancellation‑fee complaints in the Latin‑America market; the generated report pinpointed root causes and suggested fixes within minutes, replacing hours of manual review.
Key takeaways are: (1) architectural control of LLM context precedes prompt engineering as the primary lever for accuracy; (2) data‑layer label quality defines the upper bound of classification performance; (3) externalizing configuration yields maintainable, scalable pipelines; and (4) traceable reasoning transforms quality inspection from a black box into a verifiable decision source. Future work will extend the system to additional languages and lines and explore cross‑pipeline data feedback, such as feeding intent insights back into VOC trend attribution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
