How Multiple‑Instance Learning Boosts Context Understanding in Video Anomaly Detection

The article reviews the CVPR 2021 MIST framework, explaining how a multiple‑instance pseudo‑label generator and a self‑guided attention encoder work together with sparse continuous sampling to improve context awareness and detection accuracy in weakly‑supervised video anomaly detection.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
How Multiple‑Instance Learning Boosts Context Understanding in Video Anomaly Detection

Paper overview – The CVPR 2021 paper MIST: Multiple Instance Self‑Training Framework for Video Anomaly Detection proposes a weakly‑supervised VAD system that combines a multiple‑instance pseudo‑label generator (G) with a self‑guided attention‑enhanced feature encoder (Esga). Video‑level abnormal labels are turned into clip‑level pseudo‑labels, which together with normal clips train Esga. The method builds on the earlier “Real‑world Anomaly Detection in Surveillance Videos” work and adds an attention module.

Overall pipeline – A schematic (see first image) shows the flow from raw video to feature extraction, pseudo‑label generation, and attention‑guided encoding. The authors highlight a “sparse continuous sampling strategy” that forces the network to focus on the context surrounding the most abnormal segments.

Part 2 – MIL pseudo‑label generation – The first stage follows the CVPR 2018 approach but adds sparsity. An encoder E (I3D or C3D) extracts clip features; these are fed to G to produce clip‑level scores. A diagram (second image) illustrates the process.

Part 3 – Sparse sampling for context focus – The paper distinguishes fine‑grained and coarse‑grained segmentation. Coarse segmentation can miss short anomalies; fine segmentation isolates peaks but ignores surrounding context. To capture the whole abnormal segment, the authors define a minimum abnormal duration and sample uniformly L subsets (SubBags), each containing T consecutive clips. L provides coarse coverage, while T (set to the assumed shortest abnormal duration) offers fine granularity.

The procedure:

From the feature sequence F(1‑N), uniformly sample L subsets, each with T consecutive clips, forming L SubBags.

Feed each F(L, t) into G to obtain scores S(L, t).

Apply average pooling within each SubBag to smooth the T scores, ensuring temporal consistency.

Part 4 – Training the pseudo‑label generator G – G is a three‑layer MLP with 512, 32, and 1 units, using dropout 0.6 between layers. Each SubBag is treated as an instance during training. The loss combines a max(0,·) term with a sparse regularization (ε = 1) to prevent over‑fitting, based on the assumption that anomalies are sparse. After training, G generates pseudo‑labels S(1‑N) for all clips, which are then smoothed with a moving‑average filter to reduce temporal jitter.

Part 5 – Self‑guided attention feature encoder (Esga) – Esga has a two‑branch design: an attention‑map branch and a classification branch. The attention branch consists of three convolutional blocks (F1, F2, F3) as described in the paper, producing an attention map from the 4th block of encoder E (Mb‑4). The classification branch uses a head Hg, global average pooling, softmax, and cross‑entropy loss to guide the attention map.

Part 6 – Anomaly detection branch – The final 5th‑stage features from E are combined with the attention map, passed through global average pooling, and fed to a classification head Hc that outputs anomaly scores. Both branches share cross‑entropy losses (L1 and L2) with regularization, but the final detection relies solely on the Hc scores.

The article concludes that the sparse continuous sampling and the self‑guided attention encoder together enable the model to capture richer contextual semantics around abnormal events, improving video anomaly detection performance under weak supervision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionWeak SupervisionSelf‑TrainingVideo Anomaly DetectionAttention EncoderMultiple Instance Learning
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.