How BC‑GNN Improves Temporal Action Proposals with Boundary‑Content Graph Modeling
The paper introduces Boundary Content Graph Neural Network (BC‑GNN), a graph‑based approach that jointly models boundary and content predictions to generate more accurate temporal action proposals and reliable confidence scores, achieving state‑of‑the‑art results on ActivityNet‑1.3 and THUMOS‑14.
Overview
Temporal action proposal generation aims to locate high‑quality action segments in long, untrimmed videos. Existing pipelines predict start/end boundaries first, then combine them into proposals and finally estimate content confidence, which ignores the interdependence between boundary and content predictions. Boundary‑Content Graph Neural Network (BC‑GNN) addresses this limitation by representing start points and end points as graph nodes and the corresponding video segment as an edge, allowing joint reasoning over boundaries and content.
Method
1. Feature Extraction
A two‑stream network (spatial RGB branch and temporal optical‑flow branch) processes each video snippet. For a video divided into T snippets, each snippet is encoded into a D -dimensional feature vector by concatenating the final‑layer outputs of the two branches, producing a T×D feature matrix.
2. Base Module
Two 1‑D convolutional layers enlarge the temporal receptive field and serve as the backbone for subsequent modules.
3. Graph Construction Module
A bipartite graph G = (U, V, E) is built, where U contains start‑point nodes ns,i (time ts,i) and V contains end‑point nodes ne,j (time te,j). An edge di,j exists if te,j > ts,i. Edge features are obtained by linearly interpolating the content feature matrix Fc between rows i and j, reshaping the interpolated N×D' matrix to a vector, and passing it through a fully‑connected layer to produce a D' -dimensional edge embedding. The graph is directed: each undirected edge is replaced by two opposite directed edges that share the same embedding.
4. Graph Reasoning Module
Reasoning proceeds in two alternating steps.
Edge update : For each directed edge e_{h→t}, the new edge feature e'_{h→t} is computed by aggregating the features of its incident nodes: e'_{h→t} = σ(θ_{s2e}·n_h + θ_{e2s}·n_t) where n_h and n_t are the start‑ and end‑node features, θ are learnable linear transforms, and σ is ReLU.
Node update : For each node n_h, incoming edge features are first normalized, then aggregated as a weighted sum to update the node: n'_h = σ(θ_{node}·(∑_{e∈E_h} α_e·e)) where α_e is the normalized weight of edge e and θ_{node} is a learnable matrix. This step fuses boundary and content information.
The updates are applied iteratively, allowing boundary and content features to refine each other.
5. Output Module
After reasoning, a linear classifier predicts:
Start‑point probability from updated start‑node features.
End‑point probability from updated end‑node features.
Content confidence from updated edge features.
Proposals are generated by pairing start and end nodes whose temporal order is valid and scoring them with the product of the three predictions.
Experiments
BC‑GNN was evaluated on two public benchmarks:
ActivityNet‑1.3
THUMOS‑14
Both temporal action proposal generation and temporal action detection were tested.
Proposal Generation
BC‑GNN achieved state‑of‑the‑art average recall (AR) and area‑under‑curve (AUC) metrics on both datasets, surpassing previous methods.
Action Detection
Using the generated proposals as inputs to a classifier, BC‑GNN attained top‑rank mAP scores on the detection task.
Ablation Study
Two design choices were validated:
Converting the undirected bipartite graph to a directed graph improves temporal ordering information.
Adding explicit edge‑feature updates further boosts performance.
Both modifications contributed positively to the final results on ActivityNet‑1.3.
Key Contributions
Joint modeling of boundary and content via a boundary‑content graph, enabling mutual refinement.
A novel graph reasoning mechanism that updates node and edge features with fused boundary‑content information.
Demonstration that the graph‑based approach can be extended to related video‑understanding tasks.
References
ECCV 2020 proceedings: https://eccv2020.eu/accepted-papers
arXiv preprint: http://arxiv.org/abs/2008.01432
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
