Artificial Intelligence 18 min read

Building an In-Car Voice Assistant: From Wake‑Word to NLP

This article details the end‑to‑end development of an in‑vehicle voice assistant, covering motivation, functional design, technology stack selection, dialogue flow, privacy, third‑party integration, wake‑word detection, on‑device speech recognition, noise filtering, NLP processing, and deployment considerations.

Huolala Tech

Jul 9, 2024

Building an In-Car Voice Assistant: From Wake‑Word to NLP

On March 28, Xiaomi SU7 demonstrated an AI large model combined with the XiaoAi voice assistant, offering a new intelligent driving experience with fuzzy command recognition, five‑tone voice interaction, and seamless connection between phone, Pad, and vehicle.

Why Build a Voice Assistant

A voice assistant converts spoken commands into actionable instructions using ASR and returns feedback via TTS, providing several advantages:

Improved driving safety: Drivers can operate functions hands‑free, reducing distraction and risk.

Convenient operation: Natural, intuitive interaction makes controlling vehicle features easier.

Supports multitasking: Voice interaction allows drivers to stay focused on driving while issuing commands.

Personalized experience: The assistant can adapt to driver habits and preferences.

How to Build a Useful Voice Assistant

Creating a functional voice assistant involves multiple technical domains, including ASR, NLP, TTS, and business logic.

1. Define Features and Target Users

Feature scope: Specify tasks such as app navigation, heat‑map queries, order status, nearby recommendations, billing reports, and member suggestions.

Target users: Understand user needs to design appropriate interactions.

2. Choose the Technology Stack

ASR: Convert speech to text.

NLP: Parse user intent.

Business logic: Execute actions based on intent, possibly invoking third‑party APIs.

TTS: Convert text responses back to speech.

3. Design Dialogue Flow

Single/multi‑turn capability: Support complex requests with multi‑turn conversations.

Intent and entity recognition: Clearly define intents and extract required entities.

Error handling and feedback: Provide robust mechanisms for misunderstandings.

4. Implement and Test

Rapid prototyping: Build an MVP to validate ideas.

User testing: Gather feedback to refine features.

Iterate and optimize: Continuously improve based on data.

5. Ensure Privacy and Security

User data protection: Comply with regulations and secure data.

Transparency: Clearly explain data usage to users.

6. Integrate Third‑Party Services

API integration: Add weather, news, calendar, etc., to extend functionality.

7. Continuous Learning and Adaptation

Machine learning: Analyze user behavior to improve performance.

Adaptability: Update the assistant as technology and user needs evolve.

Technical Framework for the Voice Assistant

1. Voice Wake‑Word

1.1 Solution Selection

Open‑source wake‑word projects such as FunASR, Baidu Voice Wake‑Up, Sherpa‑Onnx, and Teachable Machine were evaluated based on cost, size, offline capability, accuracy, CPU usage, and power consumption. Projects supporting custom wake‑words, small SDK size, and mobile deployment were chosen.

1.2 Wake‑Word‑Free Commands

High‑frequency commands were added that can be invoked without an explicit wake‑word, e.g., saying “go back” directly exits the current screen, reducing manual interaction.

1.3 Improving Wake‑Word Accuracy

Drivers record personalized wake‑words; the system extracts the recognized text and uses it as a custom keyword, adapting to regional accents and improving detection.

2. Speech Recognition

2.1 Model Selection

On‑device models were required to avoid latency and cost. After evaluating frameworks such as WeNet, FunASR, Sherpa‑ncnn, Sherpa‑onnx, Huawei, iFlytek, Baidu, Whisper, and DeepSpeech, Huawei and Sherpa‑ncnn were selected for Android AB‑testing.

2.2 Pre‑Processing Optimization

Non‑human audio (navigation prompts, music, engine noise) is filtered before ASR to improve accuracy. Apple’s ecosystem provides built‑in sound classification; Android equivalents are referenced.

2.2.1 Feasibility

Human voice differs from machine‑generated sounds in pitch, speed, and background noise.

Deep learning models (CNN, RNN, LSTM, Transformer) can reliably classify audio.

Modern hardware enables near‑real‑time processing.

2.2.2 Implementation Steps

Collect and label large datasets of human speech, navigation prompts, and background sounds.

Extract features (MFCC, spectral, prosody).

Train classification models (CNN/LSTM/Transformer) to distinguish human voice.

Integrate real‑time classification before ASR, filtering out non‑human audio.

Continuously test and optimize models for various environments.

Deploy the model within the speech‑recognition pipeline.

2.3 Post‑Recognition NLP Filtering

Initial implementation used regex matching on recognized text to map commands. Future plans include fine‑tuning BERT for intent understanding, though privacy concerns limit third‑party large‑model usage.

2.3.1 Data Collection and Tagging

Sample commands (e.g., “open heat map”, “refresh order”, “call passenger”) were collected and manually labeled; the dataset includes dozens of variations per intent.

2.3.2 Model Training and Evaluation

Training achieved high accuracy and recall, as shown in the figures.

Conclusion

Building a voice assistant requires careful consideration of safety, usability, privacy, and technical feasibility. The project leveraged open‑source speech technologies, on‑device models, and continuous data‑driven improvement, resulting in a functional in‑car voice interface that enhances driver experience.

For more details, see the voice‑technology research blog linked in the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

natural language processing speech recognition Voice Assistant in‑car technology text‑to‑speech

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.