How Advanced LSTM (A‑LSTM) Boosts Speech Emotion Recognition by 5.5%
This article introduces Advanced LSTM (A‑LSTM), which linearly combines multiple past hidden states to overcome traditional LSTM's one‑step dependency, and demonstrates its application in utterance‑level speech emotion recognition, achieving a 5.5% accuracy improvement through attention‑based weighted‑pooling RNNs and auxiliary speaker and gender tasks.
Research Background
LSTM is widely used in recurrent neural networks for sequence modeling, but its hidden state at the current time depends only on the previous time step, which may limit modeling of long‑range temporal dynamics. To address this limitation, the authors propose Advanced LSTM (A‑LSTM), which linearly combines hidden states from several past time steps, thereby breaking the traditional one‑step dependency.
Advanced Long Short‑Term Memory Network
A‑LSTM uses a linear combination of multiple previous hidden states, computed with a mechanism similar to attention. The combined representation C'(t) is fed into the next time step, allowing the network to assign different weights to distant and near past states.
Attention‑Based Weighted‑Pooling Recurrent Neural Network
The authors employ an attention‑based weighted‑pooling RNN for emotion recognition. The network takes acoustic feature sequences as input, uses attention to automatically adjust the weight of each time step, and then performs a weighted average (pooling) to obtain a representation of the entire sequence, effectively suppressing irrelevant segments such as long silences.
Two auxiliary tasks—speaker identification and gender identification—are added to improve training. In the A‑LSTM variant, the bidirectional LSTM layer is replaced with a bidirectional A‑LSTM that combines three past states (t‑5, t‑3, t‑1) via linear attention.
Experiments
The experiments use the IEMOCAP dataset with four emotion classes (happy, angry, sad, neutral), containing 4,490 utterances. One male and one female speaker are held out for testing; the rest are used for training with 10% for validation. Evaluation metrics include unweighted average F‑score (MAF), unweighted average precision (MAP), and accuracy.
Thirty‑six acoustic features (e.g., MFCC, zero‑crossing rate, energy, spectral centroid, chroma vectors, harmonic ratio, pitch) are extracted, normalized at the utterance level, and fed into the system.
The network consists of a fully connected layer (256 ReLU units) followed by a bidirectional LSTM (256 units per direction) or bidirectional A‑LSTM, then the attention‑based weighted‑pooling layer, and three task‑specific output layers with weights 1.0 (emotion), 0.3 (speaker), and 0.6 (gender).
Results show that replacing the LSTM with A‑LSTM improves emotion recognition accuracy by 5.5% compared to the baseline, while the parameter increase is only a few hundred.
Conclusion
The A‑LSTM‑enhanced system outperforms the traditional LSTM baseline, confirming that the more flexible temporal dependency modeling provided by A‑LSTM leads to better performance, without significant computational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
