Detecting Time‑Series Anomalies in Embedding Space: A Practical AI Approach

This article presents an embedding‑based method for time‑series anomaly detection in security and anti‑cheat scenarios, explains how to vectorise logs, sample and compute distribution features, details implementation code, and validates the approach with four synthetic experiments showing precision‑recall improvements at day and hour granularity.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Detecting Time‑Series Anomalies in Embedding Space: A Practical AI Approach

Background

In security and anti‑cheat scenarios, detecting abnormal traffic and user behavior is a basic requirement. Traditional methods aggregate metrics such as PV, UV, and failure rate over time and model their distribution to spot deviations.

Traditional Dimensional Monitoring

One common practice is to split monitoring dimensions (e.g., source channel × operation type) into many discrete metrics, collect 30‑day historical data, and train models like EllipticEnvelope for each dimension. This improves sensitivity but suffers from two limitations: dimensions must be discrete and enumerated, and the granularity must balance sensitivity against noise.

Embedding‑Based Anomaly Detection

The article proposes treating each log entry as an independent sample, mapping it into a high‑dimensional embedding space via vectorisation (e.g., one‑hot‑like encoding of UserAgent characters) and then analysing the distribution of samples.

If the current sample’s distribution differs from historical distributions, it is considered anomalous. The approach avoids explicit business‑dimension splitting.

Embedding Construction

def to_vector(ua):
    if isinstance(ua, (list, tuple)):
        return [to_vector(c) for c in ua]
    else:
        vec = np.zeros(128)
        for c in ua:
            vec[ord(c) % 128] += 1  # most characters are ASCII
        l2 = np.sqrt(np.sum(vec * vec))
        if l2 != 0:
            vec /= l2
        return vec.tolist()

Data Preparation

Approximately 1.6 million user‑agent logs from the last 30 days are vectorised (128‑dimensional) and stored in a vector database (e.g., Chroma) keyed by day and hour.

for day in days:
    for hour in hours:
        collection = chroma_client.get_or_create_collection(
            name="{}_{}_{}".format(name_prefix, day.strftime("%Y%m%d"), f"{hour:02d}")
        )
        # upsert ids, documents, metadatas, embeddings …

Sampling and Feature Extraction

For each time slice, 100 random UA samples are drawn, their nearest neighbours are queried, and statistical features such as PV sum, mean distance, and density metrics are computed. These features are aggregated per day and per UA.

AREA_EXP = [0, 2, 8]
MODEL_FIELDS = ["pv", "dist"] + [f"dens_{i}" for i in AREA_EXP] + ["dens_s"]
# aggregation code …

Experiments

Four synthetic anomaly scenarios were created:

Random UA with PV spikes (10 %–1000 %).

Partially similar UA with PV growth.

Generated similar UA with proportional PV increase.

Generated similar UA with no overall PV growth.

For each scenario the method was evaluated at day‑level and hour‑level detection, and precision‑recall curves were plotted. The article includes several charts illustrating confidence‑score distributions and PR results.

Key Findings

Embedding‑based detection achieves clear separation between anomalous and normal samples in both day and hour granularity.

Precision‑recall improves with higher thresholds, and hour‑level detection is more sensitive to short‑term spikes.

The approach works without explicit dimension enumeration, simplifying monitoring pipelines.

Conclusion and Outlook

Experiments confirm the feasibility of using high‑dimensional embeddings for time‑series anomaly detection in security contexts. Future work should address practical engineering concerns such as sampling‑point selection, embedding quality, distance metrics, and real‑time deployment.

Open challenges include rapid localisation of anomalous samples, automated impact assessment, and generation of actionable defence strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningclusteringanomaly detectionSecurityEmbeddingTime Series
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.