How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling

The paper introduces Boundary Content Graph Neural Network (BC‑GNN), a graph‑based approach that jointly models boundary and content predictions to generate more accurate temporal action proposals and reliable confidence scores, achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS‑14.

ITPUB
ITPUB
ITPUB
How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling

Overview

Temporal action proposal generation aims to locate high‑quality action segments in long, untrimmed videos. Existing pipelines predict start/end boundaries first, then combine them into proposals and finally estimate content confidence, which ignores the interdependence between boundary and content predictions. Boundary‑Content Graph Neural Network (BC‑GNN) addresses this limitation by representing start points and end points as graph nodes and the corresponding video segment as an edge, allowing joint reasoning over boundaries and content.

Method

1. Feature Extraction

A two‑stream network (spatial RGB branch and temporal optical‑flow branch) processes each video snippet. For a video divided into T snippets, each snippet is encoded into a D -dimensional feature vector by concatenating the final‑layer outputs of the two branches, producing a T×D feature matrix.

2. Base Module

Two 1‑D convolutional layers enlarge the temporal receptive field and serve as the backbone for subsequent modules.

3. Graph Construction Module

A bipartite graph G = (U, V, E) is built, where U contains start‑point nodes ns,i (time ts,i) and V contains end‑point nodes ne,j (time te,j). An edge di,j exists if te,j > ts,i. Edge features are obtained by linearly interpolating the content feature matrix Fc between rows i and j, reshaping the interpolated N×D' matrix to a vector, and passing it through a fully‑connected layer to produce a D' -dimensional edge embedding. The graph is directed: each undirected edge is replaced by two opposite directed edges that share the same embedding.

4. Graph Reasoning Module

Reasoning proceeds in two alternating steps.

Edge update : For each directed edge e_{h→t}, the new edge feature e'_{h→t} is computed by aggregating the features of its incident nodes: e'_{h→t} = σ(θ_{s2e}·n_h + θ_{e2s}·n_t) where n_h and n_t are the start‑ and end‑node features, θ are learnable linear transforms, and σ is ReLU.

Node update : For each node n_h, incoming edge features are first normalized, then aggregated as a weighted sum to update the node: n'_h = σ(θ_{node}·(∑_{e∈E_h} α_e·e)) where α_e is the normalized weight of edge e and θ_{node} is a learnable matrix. This step fuses boundary and content information.

The updates are applied iteratively, allowing boundary and content features to refine each other.

5. Output Module

After reasoning, a linear classifier predicts:

Start‑point probability from updated start‑node features.

End‑point probability from updated end‑node features.

Content confidence from updated edge features.

Proposals are generated by pairing start and end nodes whose temporal order is valid and scoring them with the product of the three predictions.

Experiments

BC‑GNN was evaluated on two public benchmarks:

ActivityNet‑1.3

THUMOS‑14

Both temporal action proposal generation and temporal action detection were tested.

Proposal Generation

BC‑GNN achieved state‑of‑the‑art average recall (AR) and area‑under‑curve (AUC) metrics on both datasets, surpassing previous methods.

Action Detection

Using the generated proposals as inputs to a classifier, BC‑GNN attained top‑rank mAP scores on the detection task.

Ablation Study

Two design choices were validated:

Converting the undirected bipartite graph to a directed graph improves temporal ordering information.

Adding explicit edge‑feature updates further boosts performance.

Both modifications contributed positively to the final results on ActivityNet‑1.3.

Key Contributions

Joint modeling of boundary and content via a boundary‑content graph, enabling mutual refinement.

A novel graph reasoning mechanism that updates node and edge features with fused boundary‑content information.

Demonstration that the graph‑based approach can be extended to related video‑understanding tasks.

References

ECCV 2020 proceedings: https://eccv2020.eu/accepted-papers

arXiv preprint: http://arxiv.org/abs/2008.01432

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

video understandingGraph Neural NetworkBC-GNNtemporal action proposalECCV2020
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.