Artificial Intelligence 18 min read

DuIVRS-2: End-to-End Large-Scale Interactive POI Update System

The article analyzes Baidu's DuIVRS-2, an end‑to‑end large‑scale interactive voice‑response system for POI data collection, detailing its architectural innovations, data‑augmentation, low‑latency LLM management, dual‑model iterative learning, engineering optimizations, and extensive offline and online experiments that demonstrate superior accuracy, speed, and cost efficiency over prior solutions.

Baidu Maps Tech Team

Jun 11, 2026

DuIVRS-2: End-to-End Large-Scale Interactive POI Update System

POI (point‑of‑interest) data underpins digital maps, local‑service platforms, and intelligent navigation, but frequent updates—74.5% of POIs changed in 2020—require automated, scalable collection. Traditional pipelines rely on manual verification, user reports, or web scraping, which are costly and slow.

Interactive voice response (IVR) has become the dominant approach, where a system calls merchants and gathers attribute information via dialogue. The earlier Baidu system DuIVRS‑1 used a modular pipeline (NLU → DM → NLG). This architecture suffered three major drawbacks: error accumulation across modules, high maintenance cost due to separate module updates, and difficulty deploying large language models (LLMs) because of high inference latency, high compute cost, hallucinations, and unstable output.

DuIVRS‑2 replaces the modular pipeline with an end‑to‑end LLM‑driven architecture built around four core techniques:

FSM‑guided data augmentation : A finite‑state machine derived from production logs maps fixed reply templates to states and merchant utterances to transitions. Two sampling strategies—path sampling (uniformly selecting dialogue lengths) and transition sampling (uniformly selecting diverse replies between states)—balance the training set, eliminating long‑tail imbalance without manual labeling.

Lightweight LLM with constrained generation and chain‑of‑thought (CoT) reasoning : A sub‑20‑billion‑parameter model (ERNIE‑Bot‑tiny) ensures sub‑200 ms response time. The system restricts output to FSM‑defined legal replies and forces the model to perform intent inference and scenario judgment before selecting a response, reducing hallucinations to 0% and improving interpretability.

Dual‑model collaborative iterative learning : A domain‑fine‑tuned model (ERNIE‑Bot‑turbo) evaluates generated dialogues, while a black‑box general model (ERNIE 4.0) serves as an unbiased judge. High‑confidence samples are added to the training set; ambiguous samples are sent for human review, dramatically cutting manual effort.

Engineering deployment optimizations : Using the PaddlePaddle framework and FastDeploy, the system converts dynamic graphs to static, applies int8 quantization, and integrates resource‑scheduling to handle millions of daily calls with 130 ms latency, while retaining ASR/TTS modules and a fallback mechanism for failed generations.

Experiments were conducted on three stratified test sets (D_effective, D_general, D_robust) with 5,000 initial dialogues and incremental 5,000‑sample updates per iteration. Baselines included ERNIE‑Bot‑tiny, ERNIE‑Bot‑turbo, ERNIE 4.0, GPT‑4o, DeepSeek‑V3, and Qwen2.5 series. Metrics were consistency rate (CR) and task‑success rate (TSR).

Offline results show DuIVRS‑2 achieving 77.18% CR, a 9.1‑point gain over DuIVRS‑1 and up to 15.74% over GPT‑4o. Ablation studies reveal that removing data augmentation drops CR to 64.33%, removing CoT reduces it to 39.00%, and using direct fine‑tuning yields only 60.80%.

Iterative learning experiments demonstrate performance saturation after 3–4 rounds, with decreasing human‑review ratios and hallucination rates falling from 2.08% (baseline) to 0%.

Two‑month online A/B testing compared human agents, DuIVRS‑1, and DuIVRS‑2. DuIVRS‑2 reached an 83.9% TSR (within 4 points of human agents), per‑call cost below ¥0.2, 130 ms average latency, and the ability to process 400,000 calls per day—over a thousand‑fold increase versus human capacity.

Overall, DuIVRS‑2 delivers a balanced solution for POI attribute collection, excelling in accuracy, efficiency, cost, and stability, and its modular‑agnostic design can be transferred to other task‑oriented dialogue domains such as customer service, outbound calling, and industrial Q&A.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data augmentation LLM IVR POI low latency speech recognition iterative learning

Written by

Baidu Maps Tech Team

Want to see the Baidu Maps team's technical insights, learn how top engineers tackle tough problems, or join the team? Follow the Baidu Maps Tech Team to get the answers you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.