Artificial Intelligence 11 min read

Optimizing Deep Neural Network Inference for Offline Speech Evaluation on Mobile Devices

This article describes how the English fluency app leverages deep neural network (DNN) models for real‑time speech scoring on smartphones, detailing offline inference challenges, BLAS‑based matrix‑vector optimizations, sparsity exploitation, cache‑friendly implementations, fixed‑point and NEON acceleration, as well as model compression techniques to improve accuracy and latency.

Liulishuo Tech Team

Sep 3, 2016

Optimizing Deep Neural Network Inference for Offline Speech Evaluation on Mobile Devices

Deep Learning (Deep Learning)

Deep learning, a branch of machine learning, has revived artificial intelligence research and includes popular model families such as Feed‑Forward Neural Networks (DNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN).

DNN

The English fluency app uses a DNN‑based speech evaluation algorithm that provides real‑time scoring and feedback, dramatically improving benchmark accuracy compared with traditional models.

BLAS

BLAS (Basic Linear Algebra Subprograms) defines standard APIs for vector and matrix operations; common implementations include OpenBLAS and ATLAS. Level‑2 BLAS provides the gemv routine for matrix‑vector multiplication, which is the core operation in DNN inference.

void cblas_sgemv(const enum CBLAS_ORDER Order,
                const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                const float alpha, const float *A, const int lda,
                const float *X, const int incX, const float beta,
                float *Y, const int incY);

void cblas_dgemv(const enum CBLAS_ORDER Order,
                const enum CBLAS_TRANSPOSE TransA, const int M, const int N,
                const double alpha, const double *A, const int lda,
                const double *X, const int incX, const double beta,
                double *Y, const int incY);

DNN Computation Optimization

1. Merge and Remove Unnecessary Computations

Feature Normalization Merging

By folding mean‑subtraction and variance‑scaling into the first affine layer, the extra normalization cost is eliminated while preserving mathematical equivalence.

Removing Softmax

Since the downstream task uses log‑softmax or directly the affine output, the softmax layer can be omitted, reducing computation without affecting ranking.

2. gemv Computation Optimizations

2.0 Native Implementation

y = β * y
for(int i = 0; i < M; i++) {
    for(int j = 0; j < N; j++) {
        y(i) += α * A(i, j) * x(j); // A(i, j) = A[i*LDA + j] LDA = N
    }
}

Exploiting x Sparsity

y = β * y
for(int j = 0; j < N; j++) {
    if (x(j) != 0.0) {
        tmp = α * x(j);
        for(int i = 0; i < M; i++) {
            y(i) += tmp * A(i, j);
        }
    }
}

When the input vector is sparse, this version avoids many multiplications, but cache misses can still dominate.

Column‑Major Storage

y = β * y
for(int j = 0; j < N; j++) {
    if (x(j) != 0.0) {
        tmp = α * x(j);
        for(int i = 0; i < M; i++) {
            y(i) += tmp * A(j, i); // column‑major layout improves cache locality
        }
    }
}

Storing the matrix in column‑major order eliminates many cache misses, yielding 1–3× speed‑ups for large dimensions.

Increasing Cache Hit Rate

Further re‑ordering and blocking techniques raise cache hit rates, providing additional performance gains.

Fixpoint Quantization & ARM NEON Acceleration

On ARM‑based mobile CPUs, the DNN parameters are quantized to fixed‑point representation and computed with NEON SIMD instructions, compensating for limited floating‑point throughput.

Other Optimizations

Pre‑allocating aligned memory, using pointer arithmetic instead of indexing, transposing matrices, and applying aggressive compiler flags also contribute to faster inference.

3. DNN Model Design

Model size can be reduced by pruning unimportant neurons, teacher‑student distillation, and low‑rank matrix factorization (SVD) while preserving accuracy.

4. Joint Optimizations

Combining sparse activations (≈50% zeros) with the Wa+b formulation reduces the number of effective multiplications, further accelerating inference.

5. Skip‑Frame Computation

Skipping frames based on speech characteristics halves the computation with negligible accuracy loss, enabling real‑time performance on older iPhone models.

Speech Evaluation Benchmark Improvements

The DNN‑based acoustic model markedly improves pronunciation scoring accuracy and noise robustness compared with traditional approaches.

Voice Activity Detection (VAD)

Integrating a DNN‑driven VAD automatically stops recording during pauses, making voice interaction smoother and more reliable.

Conclusion

Through model compression, algebraic equivalence transformations, and matrix‑level optimizations, the DNN inference runs entirely offline on mobile CPUs, delivering low‑latency feedback without network dependence; multi‑core CPU and mobile‑GPU acceleration were not pursued further.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Matrix multiplication mobile inference BLAS DNN optimization speech evaluation

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.