Artificial Intelligence 6 min read

Detecting Time‑Series Anomalies with the Anomaly Transformer’s Association Discrepancy

The article explains how the Anomaly Transformer leverages prior‑ and series‑association discrepancies, a learnable Gaussian kernel, and a Minimax training strategy to distinguish normal from abnormal points in time‑series data, achieving state‑of‑the‑art results on five benchmark datasets.

Network Intelligence Research Center (NIRC)

Aug 19, 2023

Detecting Time‑Series Anomalies with the Anomaly Transformer’s Association Discrepancy

Introduction

Real‑world systems generate massive continuous time‑series data. Detecting anomalous points is essential for safety and cost avoidance. The paper “Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy” proposes a criterion that distinguishes normal from abnormal points by comparing two types of associations for each time step.

Anomaly Judgment

For a given time step t, prior‑association measures similarity to its neighboring points, while series‑association measures similarity to all points in the series via standard Transformer attention. When an anomaly occurs, nearby points are also likely anomalous, so both associations focus locally and become similar, yielding a small discrepancy. In normal regions the series‑association spreads globally, producing a larger discrepancy. The discrepancy is quantified as the mean Kullback‑Leibler (KL) divergence between the two association distributions and serves as an anomaly score.

Prior‑association is modeled with a learnable Gaussian kernel K_σ(i,j)=exp(-(i-j)^2/(2σ^2)) whose scale σ is learned. Series‑association is obtained from the softmax attention weights of a Transformer encoder. The loss term for association discrepancy is

ℒ_KL = (1/T) * Σ_{t=1}^{T} KL(P^{prior}_t || P^{series}_t)

where P^{prior}_t and P^{series}_t are the normalized prior and series attention vectors for time t.

Anomaly Transformer Architecture

The overall architecture follows the standard Transformer encoder stack, but replaces each self‑attention block with an Anomaly‑Attention block that simultaneously computes prior‑ and series‑associations. Directly minimizing ℒ_KL would drive σ → 0, collapsing the Gaussian kernel. To prevent this, a Minimax training strategy is introduced.

Minimax Training Strategy

The training alternates between two phases:

Minimize phase : The prior‑association is encouraged to approximate the series‑association learned from the raw sequence, allowing the Gaussian kernel to adapt to diverse temporal patterns.

Maximize phase : The series‑association is optimized to enlarge the KL discrepancy, forcing attention to attend to non‑adjacent horizons.

Unsupervised representation learning uses a reconstruction loss ℒ_rec (e.g., mean‑squared error between input and decoder output). The total objective combines reconstruction and discrepancy terms:

ℒ = ℒ_rec + λ·ℒ_KL

λ

balances reconstruction and anomaly‑specific objectives.

Experimental Evaluation

The model was evaluated on five benchmark datasets covering service monitoring, aerospace telemetry, and other domains. Across all datasets the Anomaly Transformer achieved state‑of‑the‑art performance, outperforming prior methods on standard metrics such as precision, recall, and F1‑score.

Ablation studies compared (i) the full Minimax strategy versus training without the maximize phase, (ii) using a fixed Gaussian kernel versus the learnable prior, and (iii) removing the KL‑based anomaly criterion. Each ablation reduced performance, confirming the contribution of the training schedule, the adaptive prior, and the association‑discrepancy score.

References

Paper: https://arxiv.org/abs/2110.02642

Code repository: https://github.com/thuml/Anomaly-Transformer

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

transformer anomaly detection time series SOTA Association Discrepancy Minimax Training

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.