Artificial Intelligence 30 min read

Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

The document details DiDi’s driver‑side intelligent voice assistant “XiaoDi,” describing its three‑layer architecture—audio source switching controller, semantic‑parsing core, and business API—along with conflict‑resolution mechanisms, multi‑turn dialogue handling, and a four‑region UI design that together enhance driver safety, convenience, and well‑being.

Didi Tech
Didi Tech
Didi Tech
Design and Architecture of DiDi Driver-side Intelligent Voice Assistant "XiaoDi"

This document presents a comprehensive design and technical architecture of DiDi's driver-side intelligent voice assistant, named "XiaoDi". It outlines the motivation behind the assistant, focusing on improving driver safety, convenience, and psychological well‑being by reducing manual interactions and providing personalized assistance.

The system is divided into three major layers: the audio source switching controller, the semantic parsing core, and the business (semantic parsing) API. The audio source switching controller manages conflicts between the voice assistant and the trip‑recording module, ensuring that the microphone is correctly allocated and that audio data is not duplicated. It operates in three phases—load, listening flag marking, and polling—to handle various edge cases such as delayed trip‑recording activation and concurrent audio streams.

The semantic parsing core transforms raw speech input into a structured "semantic parsing element" containing request source, intent, scene, flow ID, and slot extensions. This element is sent to the semantic parsing API, which maps intents to concrete actions (direct result set) such as UI updates, navigation commands, or push notifications. The core also supports multi‑turn dialogues through flow identifiers.

The business API acts as the "brain" of the assistant, converting intent strings into actionable control fields for the driver app. It supports both driver‑initiated and platform‑initiated triggers, handling synchronous direct results and asynchronous push‑based interactions.

"XiaoDi" is also described as a visual component with button and information display modes, featuring animated states, ear‑phone detection, and a four‑region UI layout for status, messages, actions, and tips. The design emphasizes a consistent user experience across the driver’s entire usage cycle.

Overall, the document details the end‑to‑end workflow, conflict resolution strategies, and UI design of the intelligent voice assistant, demonstrating how AI, speech recognition, and mobile development techniques are integrated to create a driver‑centric, safety‑focused solution.

Mobile DevelopmentSystem ArchitectureAIDriver AppSpeech RecognitionVoice Assistant
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.