How a Composite Framework Boosts Speech Emotion Recognition in Noisy Environments

This paper presents a multi‑subsystem ensemble for voice‑based emotion recognition that leverages low‑level descriptors, high‑level iVector features, attention‑based RNNs, and text SVMs, achieving superior robustness and accuracy on the noisy MEC 2017 dataset.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How a Composite Framework Boosts Speech Emotion Recognition in Noisy Environments

Research Background

Emotion recognition (identifying happiness, sadness, etc.) is increasingly important for enhancing user experience in human‑computer interaction and for applications such as mental‑health monitoring. Voice‑based emotion recognition is attractive due to low hardware requirements, but real‑world environments introduce background noise and spontaneous non‑speech sounds (crying, laughing, coughing) that degrade performance.

Challenges

Background noise interferes with traditional utterance‑level statistical feature extraction.

Spontaneous speech contains non‑speech sounds that may be emotion‑related or irrelevant, further reducing robustness.

Composite Emotion Recognition Framework

The proposed framework combines several subsystems to extract complementary information from the input speech:

Low‑level descriptor (LLD) subsystem: a deep neural network (DNN) trained with multitask learning using openSMILE Interspeech 2010 LLD features. The trunk has two hidden layers of 4096 neurons each; each task branch adds a 1024‑neuron hidden layer followed by a softmax layer. ReLU activations are used.

High‑level iVector subsystem: a similar DNN architecture but with 1024‑neuron hidden layers, trained on 200‑dimensional iVector features extracted from a 4000‑hour ASR system.

Sequence‑based subsystem: a recurrent neural network (RNN) with an attention‑based weighted‑pooling layer that converts the input sequence into a high‑level representation for classification. This subsystem also uses multitask learning.

Text‑based subsystem: a support vector machine (SVM) that classifies emotions using transcripts obtained from an ASR system.

The outputs of all subsystems are linearly combined to produce the final emotion prediction.

Fig1 The proposed ensemble framework for emotion recognition
Fig1 The proposed ensemble framework for emotion recognition
Fig2 The multitask learning DNN
Fig2 The multitask learning DNN
Fig3 The attention based weighted pooling RNN
Fig3 The attention based weighted pooling RNN

Multitask Training

All three neural subsystems are trained jointly on three tasks: primary emotion recognition (weight = 1), speaker identification (weight = 0.3), and gender identification (weight = 0.6). The auxiliary tasks provide additional cues that improve feature learning and overall robustness.

Experiments

Evaluation was performed on the MEC 2017 dataset, which consists of movie clips containing diverse background noises (car, factory, etc.) and non‑speech sounds (crying, laughing, etc.). Metrics used were unweighted average F‑score (MAF) and accuracy, with primary focus on MAF due to class imbalance.

Two baselines were compared: the MEC‑recommended random‑forest system and a DNN trained solely on Interspeech 2010 LLD features. The proposed ensemble achieved significantly higher performance, improving accuracy by 11.9 % and 7.8 % over the two baselines respectively. Even though individual subsystems varied in accuracy, their combined output consistently outperformed each component.

Experimental results comparison
Experimental results comparison

Conclusion

The ensemble framework was also applied to a Chinese movie‑clip database, demonstrating superior performance over existing state‑of‑the‑art deep‑learning systems. These results confirm that integrating low‑level, high‑level, sequential, and textual cues in a multitask setting substantially enhances speech emotion recognition in realistic, noisy environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep neural networkattention modelmultitask learningnoisy environmentspeech emotion recognition
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.