Artificial Intelligence 14 min read

Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization

The paper presents a client‑side speech recognizer that uses a compact TensorFlow Lite Inception‑v3 CNN model combined with an optimized MFCC feature pipeline and ARM‑NEON‑accelerated, multi‑threaded processing, achieving low‑latency, high‑accuracy voice recognition on mobile and embedded devices.

Xianyu Technology

Apr 20, 2018

Client‑Side Voice Recognition with TensorFlow Lite and MFCC Optimization

This paper addresses two drawbacks of server‑side speech recognition: high latency on poor networks and heavy server resource consumption under large traffic.

To overcome these issues, the authors implement voice recognition on the client using TensorFlow Lite, a lightweight AI framework (~300 KB) whose compressed model is only a quarter of a standard TensorFlow model.

Audio features are extracted with a Mel‑frequency cepstral coefficient (MFCC) pipeline based on human auditory perception. The pipeline includes:

1) Parsing raw audio into time‑domain signals; 2) Converting to frequency domain via short‑time Fourier transform (STFT) with windowing; 3) Mapping frequencies to the Mel scale; 4) Applying discrete cosine transform (DCT) to obtain Mel‑cepstral coefficients; 5) Rendering the coefficient vectors as images for model input.

Several speed‑up techniques are applied for real‑time client execution:

• Instruction‑set acceleration using ARM NEON extensions; • Multi‑threaded processing of audio fragments; • Model acceleration with NEON‑optimized networks; • Algorithmic acceleration by limiting the frequency band (20 Hz–20 kHz), lowering the sampling rate, intelligent windowing, and silence detection.

The chosen AI model is a Convolutional Neural Network (CNN), specifically the Inception‑v3 architecture, which offers high accuracy while remaining efficient after conversion to a TensorFlow Lite model via the TOCO tool.

Training uses a balanced dataset of human‑voice and non‑voice samples (≈5 000 each), split into training, validation, and test sets. After convergence, the model is exported as a .pb file, compiled for ARM, and deployed on the client for inference.

Experimental results show that the client‑side solution achieves low‑latency, high‑accuracy voice recognition suitable for mobile and embedded devices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

neural networks TensorFlow Lite Audio Processing client-side MFCC voice recognition

Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.