How We Built a Self‑Learning API Classification System for Security

This article details a real‑world case study of how a large logistics platform created an automated, self‑evolving API asset‑classification pipeline—covering data collection, feature engineering, model training with CatBoost, confidence‑based label feedback, and lessons learned—to improve API security monitoring and reduce manual labeling effort.

Huolala Tech
Huolala Tech
Huolala Tech
How We Built a Self‑Learning API Classification System for Security

Background (Why)

HuoLala’s production environment serves millions of domains and API calls across public, internal, and third‑party scenarios. Each API can be a potential security risk, so the security team needed not only to protect APIs but also to first identify *what* to protect by assigning business tags such as registration, login, or order processing.

Current Situation (What)

Given an API’s metadata (URL, method, request/response headers and bodies), the goal is to predict its business label. This is a multi‑label classification problem that suffers from sparse labels and class imbalance. Initial labeling relied on manual effort, which is unsustainable as the number of APIs grows.

url: https://uapi.example.com/v1/user/login
method: POST
host: uapi.example.com
api: uapi.example.com/?_m=login
request_headers: {"Content-Type":"application/json",...}
request_body: {"username":"xxx","password":"xxx"}
response_body: {"code":0,"msg":"success","token":"..."}
response_headers: {"Set-Cookie":"...","Server":"nginx"}

Typical challenges include scarce labeled samples and severe class imbalance. To address this, an automated, self‑iterating classification system was built, combining feature‑engineering iterations and a self‑learning label‑feedback loop, ultimately achieving >95% confidence for ~10,000 API tags and a 5‑point accuracy gain.

Practice (How)

Feature Engineering: From "Data Talks" to "Knowledge Guides"

v1 – Brutal Key Enumeration

All unique JSON keys from URL, headers, and bodies were extracted and one‑hot encoded, producing a high‑dimensional sparse vector for each API.

Problems with v1:

Feature dimension equals the total number of distinct keys (tens of thousands), many of which are noise (e.g., generic keys like data or code).

New APIs introduce unseen keys, causing poor generalization.

Model accuracy >80% but F1 score remained low.

v2 – Manual Construction + Business Keyword List

To reduce dimensionality and improve generalization, handcrafted features were added:

URL length, depth, number of query parameters.

Counts of special characters ( - _ ? = & / .).

Boolean flags for the presence of business‑related keywords in URL, headers, or body.

# URL basic statistics
url_length = len(url)          # total length
url_depth = len(path)           # path depth
url_params_counts = len(query_params)  # number of query params

# Special character counts
special_chars = ["-","_","?","=","&","/","."]
special_chars_count = [url.count(c) for c in special_chars]

# Keyword exact‑boundary match
pattern = r'\b' + re.escape(keyword) + r'\b'
exists = 1 if re.search(pattern, url_lower) else 0
counts = len(re.findall(pattern, url_lower))

A dictionary of business keywords was defined for nine categories (e.g., register: ["register","join","signup",...], login: ["login","auth","logout",...]). This reduced the feature vector to a few hundred dimensions and lifted the minority‑class F1 from ~0.25 to >0.80, with overall accuracy around 90%.

v3 – Text + Statistical Feature Fusion

To avoid the maintenance cost of keyword lists, the API request data were serialized into a single text string and fed to CatBoost’s native text_features parameter, which automatically extracts bag‑of‑words, TF‑IDF, and n‑gram embeddings.

Example generated text for a registration API:

v1 user register username password invite_code code msg data token

CatBoost processes this text, while request‑header features that were not indicative of business logic (e.g., is_browser, has_auth) were dropped.

Self‑Learning Loop: Model Evolution

The loop consists of three steps:

Offline Feature Extraction & Initial Training : Prepare historical labeled data, extract text, keyword, and statistical features, train the model, and deploy offline batch jobs.

Confidence Tiering : Periodically predict unlabeled APIs; tags with confidence ≥95% are fed back into the training set as new samples.

High‑Precision Label Feedback : For high‑confidence predictions, a small random sample (≈5%) is manually verified before being added; low‑confidence samples are examined with SHAP to understand feature contributions and correct errors.

Key safeguards include monitoring confusion matrices after each retraining round and pausing label feedback for categories that show increasing misclassification.

Lessons Learned

1. Pseudo‑Label Trap – Systematic bias in a class can be amplified by high‑confidence pseudo‑labels; solution: stop label feedback for that class when confusion rises.

2. Fixed‑Threshold Illusion – A uniform 0.95 threshold works well for some classes (e.g., login) but not for others (e.g., config_setting); solution: use class‑specific dynamic thresholds.

3. Too Many Hard Examples – Over‑focusing on difficult samples caused over‑fitting; solution: keep hard‑example proportion ≤10% of the training set.

Outlook

Future work includes automating the retraining trigger, exploring AI‑driven label‑feedback to reduce manual effort, and fine‑tuning lightweight LLMs on the platform’s data.

feature engineeringAPI SecurityclassificationSHAPCatBoostself‑learning
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.