Artificial Intelligence 14 min read

Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization

The paper presents a client‑side speech recognizer that uses a compact TensorFlow Lite Inception‑v3 CNN model combined with an optimized MFCC feature pipeline and ARM‑NEON‑accelerated, multi‑threaded processing, achieving low‑latency, high‑accuracy voice recognition on mobile and embedded devices.

Xianyu Technology
Xianyu Technology
Xianyu Technology
Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization

This paper addresses two drawbacks of server‑side speech recognition: high latency on poor networks and heavy server resource consumption under large traffic.

To overcome these issues, the authors implement voice recognition on the client using TensorFlow Lite, a lightweight AI framework (~300 KB) whose compressed model is only a quarter of a standard TensorFlow model.

Audio features are extracted with a Mel‑frequency cepstral coefficient (MFCC) pipeline based on human auditory perception. The pipeline includes:

1) Parsing raw audio into time‑domain signals; 2) Converting to frequency domain via short‑time Fourier transform (STFT) with windowing; 3) Mapping frequencies to the Mel scale; 4) Applying discrete cosine transform (DCT) to obtain Mel‑cepstral coefficients; 5) Rendering the coefficient vectors as images for model input.

Several speed‑up techniques are applied for real‑time client execution:

• Instruction‑set acceleration using ARM NEON extensions; • Multi‑threaded processing of audio fragments; • Model acceleration with NEON‑optimized networks; • Algorithmic acceleration by limiting the frequency band (20 Hz–20 kHz), lowering the sampling rate, intelligent windowing, and silence detection.

The chosen AI model is a Convolutional Neural Network (CNN), specifically the Inception‑v3 architecture, which offers high accuracy while remaining efficient after conversion to a TensorFlow Lite model via the TOCO tool.

Training uses a balanced dataset of human‑voice and non‑voice samples (≈5 000 each), split into training, validation, and test sets. After convergence, the model is exported as a .pb file, compiled for ARM, and deployed on the client for inference.

Experimental results show that the client‑side solution achieves low‑latency, high‑accuracy voice recognition suitable for mobile and embedded devices.

neural networksTensorFlow LiteAudio Processingclient-sideMFCCVoice Recognition
Xianyu Technology
Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.