Information Security 10 min read

Detecting API Anomalous Traffic with Big Data and Machine Learning

This article outlines a comprehensive approach to API anomaly detection, covering background, objectives, a two‑layer framework with offline and real‑time feature pipelines, threshold profiling, detection methods, strategy types, and operational practices to mitigate data leakage and abuse.

Huolala Safety Emergency Response Center

Jan 9, 2025

Detecting API Anomalous Traffic with Big Data and Machine Learning

Background

APIs are critical for enterprise data flow, but they have become a major attack surface as attackers exploit them to disrupt systems and steal data. The article approaches API anomaly detection from a traffic‑analysis perspective.

Objectives

Asset discovery: inventory all APIs, map risk scenarios, and understand overall threat posture.

Capability building: develop detection and mitigation abilities tailored to business contexts and API risk categories.

Risk prevention: protect sensitive data and business‑critical interfaces from leakage, scraping, and other abuses.

The focus is on sensitive data and business‑related scenarios, using big‑data analytics, machine learning, and statistical methods to build capabilities such as API asset management, sensitive‑data leakage detection, business threat protection, and security event forensics.

Solution Overview

The proposed solution consists of an overall framework and an event‑response SOP (see Fig. 1 and Fig. 2). The practice workflow (Fig. 3) follows four main steps:

Scenario mining – identify and prioritize high‑risk APIs based on data sensitivity.

Feature engineering – construct offline and real‑time feature pipelines.

Threshold profiling – generate per‑API usage baselines and anomaly thresholds.

Daily operation – detect, respond, and remediate anomalies.

Feature Engineering

Two parallel pipelines are built:

Offline pipeline: Batch processing on a Spark‑based data platform (Hive) to aggregate, clean, and extract features for downstream analysis and model training. Suitable for long‑term trends and low‑frequency anomalies.

Real‑time pipeline: Stream processing with Flink reading from Kafka, extracting real‑time features, and feeding them to a rule engine for fast decision‑making. Ideal for high‑frequency, short‑term anomalies.

Threshold Profiling

For each API, short‑term access frequency (e.g., within N minutes) is measured and clustered using unsupervised algorithms such as DBSCAN or OneClassSVM. The minimum value of detected outliers defines the anomaly threshold for that interval. The maximum threshold across intervals is used as the reference for the period, improving robustness against traffic spikes.

Daily Operation

Detection techniques include log and traffic analysis, user‑behavior modeling, time‑series analysis, threat intelligence cross‑validation, etc. Typical anomaly types are:

Crawler traffic (pricing bots, low‑frequency captcha‑bypass bots).

Parameter‑forgery traffic (fake credentials or devices).

Medium/high‑frequency traffic deviating from normal baselines.

Strategies are categorized as:

Rule‑based: static thresholds, statistical checks.

Model‑based: unsupervised models (Isolation Forest, OneClassSVM) and clustering (K‑Means, DBSCAN).

Baseline: historical behavior modeling to detect deviations.

Disposition mechanisms operate in two latency tiers: near‑real‑time (5–10 minutes) and offline (≈2 hours). Automated actions use WAF rule engines, while semi‑automatic or manual actions involve manual policy issuance.

FAQ Highlights

Can all APIs be covered? In practice, full coverage is impossible due to scale, model error, and resource limits, but continuous monitoring of new APIs and iterative model improvement can maximize coverage.

How are anomalies judged? Offline analysis captures low‑frequency, stealthy attacks; near‑real‑time/real‑time layers detect significant deviations using threshold profiling and unsupervised learning.

Where does most manpower go? Early stages require heavy effort in data cleaning and feature engineering; later stages focus on strategy operation and scenario mining.

big data real-time processing anomaly detection Threshold Modeling

Written by

Huolala Safety Emergency Response Center

Official public account of the Huolala Safety Emergency Response Center (LLSRC)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.