Detecting Time‑Series Anomalies in Embedding Space: A Practical AI Approach
This article presents an embedding‑based method for time‑series anomaly detection in security and anti‑cheat scenarios, explains how to vectorise logs, sample and compute distribution features, details implementation code, and validates the approach with four synthetic experiments showing precision‑recall improvements at day and hour granularity.
Background
In security and anti‑cheat scenarios, detecting abnormal traffic and user behavior is a basic requirement. Traditional methods aggregate metrics such as PV, UV, and failure rate over time and model their distribution to spot deviations.
Traditional Dimensional Monitoring
One common practice is to split monitoring dimensions (e.g., source channel × operation type) into many discrete metrics, collect 30‑day historical data, and train models like EllipticEnvelope for each dimension. This improves sensitivity but suffers from two limitations: dimensions must be discrete and enumerated, and the granularity must balance sensitivity against noise.
Embedding‑Based Anomaly Detection
The article proposes treating each log entry as an independent sample, mapping it into a high‑dimensional embedding space via vectorisation (e.g., one‑hot‑like encoding of UserAgent characters) and then analysing the distribution of samples.
If the current sample’s distribution differs from historical distributions, it is considered anomalous. The approach avoids explicit business‑dimension splitting.
Embedding Construction
def to_vector(ua):
if isinstance(ua, (list, tuple)):
return [to_vector(c) for c in ua]
else:
vec = np.zeros(128)
for c in ua:
vec[ord(c) % 128] += 1 # most characters are ASCII
l2 = np.sqrt(np.sum(vec * vec))
if l2 != 0:
vec /= l2
return vec.tolist()Data Preparation
Approximately 1.6 million user‑agent logs from the last 30 days are vectorised (128‑dimensional) and stored in a vector database (e.g., Chroma) keyed by day and hour.
for day in days:
for hour in hours:
collection = chroma_client.get_or_create_collection(
name="{}_{}_{}".format(name_prefix, day.strftime("%Y%m%d"), f"{hour:02d}")
)
# upsert ids, documents, metadatas, embeddings …Sampling and Feature Extraction
For each time slice, 100 random UA samples are drawn, their nearest neighbours are queried, and statistical features such as PV sum, mean distance, and density metrics are computed. These features are aggregated per day and per UA.
AREA_EXP = [0, 2, 8]
MODEL_FIELDS = ["pv", "dist"] + [f"dens_{i}" for i in AREA_EXP] + ["dens_s"]
# aggregation code …Experiments
Four synthetic anomaly scenarios were created:
Random UA with PV spikes (10 %–1000 %).
Partially similar UA with PV growth.
Generated similar UA with proportional PV increase.
Generated similar UA with no overall PV growth.
For each scenario the method was evaluated at day‑level and hour‑level detection, and precision‑recall curves were plotted. The article includes several charts illustrating confidence‑score distributions and PR results.
Key Findings
Embedding‑based detection achieves clear separation between anomalous and normal samples in both day and hour granularity.
Precision‑recall improves with higher thresholds, and hour‑level detection is more sensitive to short‑term spikes.
The approach works without explicit dimension enumeration, simplifying monitoring pipelines.
Conclusion and Outlook
Experiments confirm the feasibility of using high‑dimensional embeddings for time‑series anomaly detection in security contexts. Future work should address practical engineering concerns such as sampling‑point selection, embedding quality, distance metrics, and real‑time deployment.
Open challenges include rapid localisation of anomalous samples, automated impact assessment, and generation of actionable defence strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
