Artificial Intelligence 30 min read

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

The document details DiDi’s driver‑side intelligent voice assistant “XiaoDi,” describing its three‑layer architecture—audio source switching controller, semantic‑parsing core, and business API—along with conflict‑resolution mechanisms, multi‑turn dialogue handling, and a four‑region UI design that together enhance driver safety, convenience, and well‑being.

Didi Tech

Apr 29, 2021

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

This document presents a comprehensive design and technical architecture of DiDi's driver-side intelligent voice assistant, named "XiaoDi". It outlines the motivation behind the assistant, focusing on improving driver safety, convenience, and psychological well‑being by reducing manual interactions and providing personalized assistance.

The system is divided into three major layers: the audio source switching controller, the semantic parsing core, and the business (semantic parsing) API. The audio source switching controller manages conflicts between the voice assistant and the trip‑recording module, ensuring that the microphone is correctly allocated and that audio data is not duplicated. It operates in three phases—load, listening flag marking, and polling—to handle various edge cases such as delayed trip‑recording activation and concurrent audio streams.

The semantic parsing core transforms raw speech input into a structured "semantic parsing element" containing request source, intent, scene, flow ID, and slot extensions. This element is sent to the semantic parsing API, which maps intents to concrete actions (direct result set) such as UI updates, navigation commands, or push notifications. The core also supports multi‑turn dialogues through flow identifiers.

The business API acts as the "brain" of the assistant, converting intent strings into actionable control fields for the driver app. It supports both driver‑initiated and platform‑initiated triggers, handling synchronous direct results and asynchronous push‑based interactions.

"XiaoDi" is also described as a visual component with button and information display modes, featuring animated states, ear‑phone detection, and a four‑region UI layout for status, messages, actions, and tips. The design emphasizes a consistent user experience across the driver’s entire usage cycle.

Overall, the document details the end‑to‑end workflow, conflict resolution strategies, and UI design of the intelligent voice assistant, demonstrating how AI, speech recognition, and mobile development techniques are integrated to create a driver‑centric, safety‑focused solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mobile Development system architecture AI Driver App speech recognition Voice Assistant

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.