How a Composite Framework Boosts Speech Emotion Recognition in Noisy Environments
This paper presents a multi‑subsystem ensemble for voice‑based emotion recognition that leverages low‑level descriptors, high‑level iVector features, attention‑based RNNs, and text SVMs, achieving superior robustness and accuracy on the noisy MEC 2017 dataset.
Research Background
Emotion recognition (identifying happiness, sadness, etc.) is increasingly important for enhancing user experience in human‑computer interaction and for applications such as mental‑health monitoring. Voice‑based emotion recognition is attractive due to low hardware requirements, but real‑world environments introduce background noise and spontaneous non‑speech sounds (crying, laughing, coughing) that degrade performance.
Challenges
Background noise interferes with traditional utterance‑level statistical feature extraction.
Spontaneous speech contains non‑speech sounds that may be emotion‑related or irrelevant, further reducing robustness.
Composite Emotion Recognition Framework
The proposed framework combines several subsystems to extract complementary information from the input speech:
Low‑level descriptor (LLD) subsystem: a deep neural network (DNN) trained with multitask learning using openSMILE Interspeech 2010 LLD features. The trunk has two hidden layers of 4096 neurons each; each task branch adds a 1024‑neuron hidden layer followed by a softmax layer. ReLU activations are used.
High‑level iVector subsystem: a similar DNN architecture but with 1024‑neuron hidden layers, trained on 200‑dimensional iVector features extracted from a 4000‑hour ASR system.
Sequence‑based subsystem: a recurrent neural network (RNN) with an attention‑based weighted‑pooling layer that converts the input sequence into a high‑level representation for classification. This subsystem also uses multitask learning.
Text‑based subsystem: a support vector machine (SVM) that classifies emotions using transcripts obtained from an ASR system.
The outputs of all subsystems are linearly combined to produce the final emotion prediction.
Multitask Training
All three neural subsystems are trained jointly on three tasks: primary emotion recognition (weight = 1), speaker identification (weight = 0.3), and gender identification (weight = 0.6). The auxiliary tasks provide additional cues that improve feature learning and overall robustness.
Experiments
Evaluation was performed on the MEC 2017 dataset, which consists of movie clips containing diverse background noises (car, factory, etc.) and non‑speech sounds (crying, laughing, etc.). Metrics used were unweighted average F‑score (MAF) and accuracy, with primary focus on MAF due to class imbalance.
Two baselines were compared: the MEC‑recommended random‑forest system and a DNN trained solely on Interspeech 2010 LLD features. The proposed ensemble achieved significantly higher performance, improving accuracy by 11.9 % and 7.8 % over the two baselines respectively. Even though individual subsystems varied in accuracy, their combined output consistently outperformed each component.
Conclusion
The ensemble framework was also applied to a Chinese movie‑clip database, demonstrating superior performance over existing state‑of‑the‑art deep‑learning systems. These results confirm that integrating low‑level, high‑level, sequential, and textual cues in a multitask setting substantially enhances speech emotion recognition in realistic, noisy environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
