Artificial Intelligence 11 min read

Voiceprint-Based Gender Recognition Using GMM‑UBM and i‑Vector Modeling for 400‑Call Center Audio

This article presents a complete voiceprint gender identification pipeline for 400‑call center recordings, detailing acoustic feature extraction, GMM‑UBM training, Joint Factor Analysis, i‑vector extraction, and logistic regression classification, achieving a reported accuracy of 97.8%.

HomeTech

Feb 19, 2020

Voiceprint-Based Gender Recognition Using GMM‑UBM and i‑Vector Modeling for 400‑Call Center Audio

1. Introduction

Voiceprint recognition (VPR), also known as speaker recognition (SRE), identifies speakers by physiological and behavioral characteristics in speech signals. Gender identification is a crucial sub‑task, enabling automated labeling of caller gender for customer profiling and reducing manual annotation costs in 400‑call center services.

2. Principle and Practice

The method captures real‑time 400‑call audio streams, performs endpoint detection, extracts a 2‑second segment, preprocesses the signal, derives acoustic features, and feeds them into a trained model for instant gender classification.

The modeling workflow includes training a speaker‑independent Universal Background Model (UBM) with diverse channel data, adapting the UBM to each call via MAP to obtain a GMM, forming a mean super‑vector, applying factor analysis to derive i‑vectors, and finally training a Logistic Regression classifier on the i‑vectors.

2.1 Acoustic Feature Extraction

Raw audio is converted from time‑domain to frequency‑domain to emulate human auditory processing, reducing dimensionality and computational load. Mel‑Frequency Cepstral Coefficients (MFCC) are extracted through pre‑emphasis, framing (25 ms window, 10 ms shift, Hamming window), FFT, filter‑bank (40 mel filters), logarithmic scaling, and discrete cosine transform, retaining the first 20 coefficients plus an energy term, resulting in a 20‑dimensional feature vector per frame.

2.2 GMM‑UBM Model Training

2.2.1 GMM

A Gaussian Mixture Model (GMM) combines multiple single‑Gaussian PDFs to approximate complex acoustic distributions.

2.2.2 GMM‑UBM

The Universal Background Model (UBM) is a GMM representing the common acoustic space across speakers and channels. Training involves EM algorithm on large, diverse corpora, followed by MAP adaptation of the UBM to each call, updating only the mean vectors to obtain a GMM‑UBM for each utterance.

Super‑vectors are formed by concatenating the adapted means, yielding a high‑dimensional representation that captures both speaker and channel information.

2.2.3 Joint Factor Analysis (JFA)

JFA decomposes the super‑vector into speaker‑specific and channel‑specific subspaces, isolating speaker‑relevant information while mitigating channel variability.

2.2.4 i‑Vector Extraction

The i‑vector framework models both speaker and channel factors in a low‑dimensional total‑variability space: M = m + T w, where M is the adapted super‑vector, m is the UBM mean, T is the total‑variability matrix, and w is the i‑vector (typically 400‑dimensional). EM is used to estimate T, after which i‑vectors are extracted for each utterance and fed to a Logistic Regression classifier for gender discrimination.

3. Conclusion

The paper outlines a complete voiceprint‑based gender recognition system, from acoustic feature extraction to i‑vector‑based Logistic Regression, achieving a reported accuracy of 97.8% on 400‑call center audio.

4. References

[1] Bahari M. H., Dehak N., Van Hamme H., "Gaussian Mixture Model Weight Supervector Decomposition and Adaptation", 2013.

[2] Ranjan S., Liu G., Hansen J. H. L., "i‑Vector PLDA based gender identification approach for severely distorted and multilingual DARPA RATS data", 2016.

[3] Matejka P., Glembek O., Castaldo F., et al., "Full‑covariance UBM and heavy‑tailed PLDA in i‑vector speaker verification", ICASSP 2011.

[4] Joanna Grzybowska, Mariusz Ziółko, "I‑Vectors in gender recognition from telephone speech", INTERSPEECH 2012.

[5] Ondrej Glembek, Jeff Ma, Pavel Matejka, et al., "Domain adaptation via within‑class covariance correction in i‑vector based speaker recognition systems", ICASSP 2014.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning acoustic features gender recognition GMM-UBM i-vector speaker verification voiceprint

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.