Voiceprint-Based Gender Recognition Using GMM‑UBM and i‑Vector Modeling for 400‑Call Center Audio
This article presents a complete voiceprint gender identification pipeline for 400‑call center recordings, detailing acoustic feature extraction, GMM‑UBM training, Joint Factor Analysis, i‑vector extraction, and logistic regression classification, achieving a reported accuracy of 97.8%.
1. Introduction
Voiceprint recognition (VPR), also known as speaker recognition (SRE), identifies speakers by physiological and behavioral characteristics in speech signals. Gender identification is a crucial sub‑task, enabling automated labeling of caller gender for customer profiling and reducing manual annotation costs in 400‑call center services.
2. Principle and Practice
The method captures real‑time 400‑call audio streams, performs endpoint detection, extracts a 2‑second segment, preprocesses the signal, derives acoustic features, and feeds them into a trained model for instant gender classification.
The modeling workflow includes training a speaker‑independent Universal Background Model (UBM) with diverse channel data, adapting the UBM to each call via MAP to obtain a GMM, forming a mean super‑vector, applying factor analysis to derive i‑vectors, and finally training a Logistic Regression classifier on the i‑vectors.
2.1 Acoustic Feature Extraction
Raw audio is converted from time‑domain to frequency‑domain to emulate human auditory processing, reducing dimensionality and computational load. Mel‑Frequency Cepstral Coefficients (MFCC) are extracted through pre‑emphasis, framing (25 ms window, 10 ms shift, Hamming window), FFT, filter‑bank (40 mel filters), logarithmic scaling, and discrete cosine transform, retaining the first 20 coefficients plus an energy term, resulting in a 20‑dimensional feature vector per frame.
2.2 GMM‑UBM Model Training
2.2.1 GMM
A Gaussian Mixture Model (GMM) combines multiple single‑Gaussian PDFs to approximate complex acoustic distributions.
2.2.2 GMM‑UBM
The Universal Background Model (UBM) is a GMM representing the common acoustic space across speakers and channels. Training involves EM algorithm on large, diverse corpora, followed by MAP adaptation of the UBM to each call, updating only the mean vectors to obtain a GMM‑UBM for each utterance.
Super‑vectors are formed by concatenating the adapted means, yielding a high‑dimensional representation that captures both speaker and channel information.
2.2.3 Joint Factor Analysis (JFA)
JFA decomposes the super‑vector into speaker‑specific and channel‑specific subspaces, isolating speaker‑relevant information while mitigating channel variability.
2.2.4 i‑Vector Extraction
The i‑vector framework models both speaker and channel factors in a low‑dimensional total‑variability space: M = m + T w, where M is the adapted super‑vector, m is the UBM mean, T is the total‑variability matrix, and w is the i‑vector (typically 400‑dimensional). EM is used to estimate T, after which i‑vectors are extracted for each utterance and fed to a Logistic Regression classifier for gender discrimination.
3. Conclusion
The paper outlines a complete voiceprint‑based gender recognition system, from acoustic feature extraction to i‑vector‑based Logistic Regression, achieving a reported accuracy of 97.8% on 400‑call center audio.
4. References
[1] Bahari M. H., Dehak N., Van Hamme H., "Gaussian Mixture Model Weight Supervector Decomposition and Adaptation", 2013.
[2] Ranjan S., Liu G., Hansen J. H. L., "i‑Vector PLDA based gender identification approach for severely distorted and multilingual DARPA RATS data", 2016.
[3] Matejka P., Glembek O., Castaldo F., et al., "Full‑covariance UBM and heavy‑tailed PLDA in i‑vector speaker verification", ICASSP 2011.
[4] Joanna Grzybowska, Mariusz Ziółko, "I‑Vectors in gender recognition from telephone speech", INTERSPEECH 2012.
[5] Ondrej Glembek, Jeff Ma, Pavel Matejka, et al., "Domain adaptation via within‑class covariance correction in i‑vector based speaker recognition systems", ICASSP 2014.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.