Artificial Intelligence 16 min read

Enlarging Long‑time Dependencies via Reinforcement‑Learning‑Based Memory Network for Movie Affective Analysis

The authors introduce a reinforcement‑learning‑driven memory network that augments long‑range dependencies for continuous valence‑arousal emotion prediction in movies, integrating five multimodal features and a DDPG‑based update policy, which yields state‑of‑the‑art performance across multiple affective‑analysis and summarization benchmarks.

Youku Technology
Youku Technology
Youku Technology
Enlarging Long‑time Dependencies via Reinforcement‑Learning‑Based Memory Network for Movie Affective Analysis

This work, authored by Zhang Jie, Zhao Yin, and Qian Kai from Alibaba Entertainment AI Brain (Beidouxing) team, was published at ACM MM 2022 under the title Enlarging the Long‑time Dependencies via RL‑based Memory Network in Movie Affective Analysis .

Background

High‑scoring movies succeed because they evoke strong audience empathy. Predicting a film’s emotional impact before release would greatly aid evaluation, editing, and marketing. The authors therefore study movie affective‑effect prediction.

Introduction

The goal is to predict viewers’ continuous Valence‑Arousal (VA) emotions while watching a film. VA provides a finer‑grained description than discrete labels (e.g., happy, sad). Accurate prediction requires modeling long‑range contextual information, which traditional sequence models (LSTM, Transformer) struggle with on very long video sequences.

Proposed Solution

The authors propose a reinforcement‑learning‑driven memory network that stores historical information and learns an update policy for the memory. Advantages include:

Enhanced memory capacity via a dedicated memory module.

Reduced computation and storage through temporal‑difference RL, avoiding gradient vanishing/explosion.

Effective capture of long‑term dependencies using value and policy networks.

Feature Extraction

Five modalities are extracted for each video segment:

Audio features via VGGish.

Background‑music emotion features.

Scene features from Places365‑pretrained VGG16.

Human pose features from OpenPose.

Facial expression features from Xception pretrained on RAF.

All modality vectors are temporally aligned, concatenated, and fused with an LSTM; the final hidden state becomes the segment representation.

Reinforcement‑Learning‑Based Memory Network

Memory Module

The memory is a read‑write matrix M ∈ ℝ^{N×d}, initialized as learnable parameters. At each time step the segment representation and the current memory form the state input to the policy network μ, which outputs actions that selectively update memory slots.

Update Strategy

Actions consist of four continuous vectors (erase, add, write, retain). The memory update follows a formula similar to Neural Turing Machines, using a softmax‑weighted combination of memory slots and a fully‑connected layer to produce the final prediction.

Reward and Value Network

Because the task is regression, the reward is defined as the negative MAE between prediction and ground‑truth VA. A post‑state value network Q(s′) estimates future expected reward, improving training stability compared with traditional action‑value networks.

Model Training

Objective Function

The authors follow the DDPG algorithm, iteratively updating the policy μ and value Q networks. Target networks μ′ and Q′ are updated with a soft‑update rule (θ←τθ+(1−τ)θ′) to stabilize learning.

Exploration Strategy

Gaussian noise N(0,0.05) is added to actions, followed by clipping and normalization, to encourage exploration.

Results

Extensive experiments on multiple datasets (LIRIS‑ACCEDE for video emotion, PMemo for music emotion, TVSum and SumMe for video summarization) show consistent SOTA improvements.

Table 1–4 (shown as images) compare the proposed method against recent baselines, demonstrating higher accuracy and lower error.

Ablation Studies

Memory size experiments reveal optimal performance at a memory capacity of 10 slots; larger sizes lead to over‑fitting.

Comparisons between the RL‑driven update, a vanilla memory network, and TBPTT show the RL approach yields the best results.

Visualization of memory updates demonstrates that certain memory slots correlate strongly with the valence dimension, confirming that the network stores emotion‑relevant information.

Modality ablation shows that scene features contribute the most among single modalities, while fusing all five modalities yields the best performance.

Future Work

The authors plan to explore alternative RL algorithms, design more effective multimodal fusion strategies, and apply the framework to other tasks such as action detection.

References

Key references include the original LSTM paper, the Transformer paper, DDPG, Neural Turing Machines, and several multimodal and affective‑computing datasets.

reinforcement learninglong‑term dependenciesMemory Networkmovie emotion analysismultimodal fusionVA affect model
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.