Information Security 13 min read

How to Auto‑Label 10K APIs with 95% Confidence Using Self‑Learning Feature Engineering

This article presents a detailed case study of how a large‑scale API security team built an automated, self‑learning classification system that tags tens of thousands of APIs with business labels, improves model accuracy by five points, and maintains high precision through a confidence‑driven feedback loop.

Huolala Safety Emergency Response Center

Apr 15, 2026

How to Auto‑Label 10K APIs with 95% Confidence Using Self‑Learning Feature Engineering

Background (Why)

HuoLala’s production environment serves massive numbers of domains and API calls across public, internal, and third‑party scenarios. Each API can be a potential security risk, so the security team needed to first identify *what* to protect before deciding *how* to protect it. The goal was to assign business tags (e.g., registration, login, order transaction) to APIs, enabling targeted anomaly detection and baseline alerts.

Current Situation (What)

Given an API’s metadata (URL, method, request/response headers and bodies), the task is to predict its business label—a multi‑label classification problem. A typical API example is shown below:

url: https://uapi.example.com/v1/user/login
method: POST
host: uapi.example.com
api: uapi.example.com/?_m=login
request_headers: {"Content-Type":"application/json",...}
request_body: {"username":"xxx","password":"xxx"}
response_body: {"code":0,"msg":"success","token":"..."}
response_headers: {"Set-Cookie":"...","Server":"nginx"}

The main challenges are sparse labels and class imbalance. Early efforts relied on manual labeling, which is unsustainable for the ever‑growing API catalog.

To address this, the team built an automated, self‑iterating classification system that now tags about 10,000 APIs with confidence ≥ 95 % and improves model accuracy by five percentage points.

Practice (How)

Feature Engineering: From "Data Talks" to "Knowledge Guidance"

v1 – Brutal Key Enumeration

All unique JSON keys from URLs, headers, and bodies were extracted and one‑hot encoded, producing a high‑dimensional sparse vector for each API.

This approach suffered from massive dimensionality, noisy common keys (e.g., data, code), and poor generalization to unseen keys.

v2 – Manual Construction + Business Keyword Lists

To reduce dimensionality, the team added engineered features such as URL length, depth, number of query parameters, presence of special characters, and boolean/count features. A curated dictionary of business keywords for nine categories was also introduced, with exact‑boundary matching across URL, headers, and bodies.

# URL basic statistics
url_length = len(url)          # total length
url_depth  = len(path)          # path depth
url_params_counts = len(query_params)  # number of query params

# Special character counts
special_chars = ["-","_","?","=","&","/","."]
special_chars_count = [url.count(c) for c in special_chars]

# Keyword matching (exact word boundary)
pattern = r'\b' + re.escape(keyword) + r'\b'
exists = 1 if re.search(pattern, url_lower) else 0
counts = len(re.findall(pattern, url_lower))

Keyword dictionary example:

business_keywords = {
    "register": ["register","join","signup","enroll",...],
    "login":    ["login","auth","logout","account"],
    "sms_code": ["send","sms","verification","phone"],
    ...
}

Dimensionality dropped from tens of thousands to a few hundred, raising minority‑class F1 from ~0.25 to >0.80 and overall accuracy to ~90 %.

v3 – Mixed Text + Statistical Features

To overcome the maintenance cost of keyword lists, the team leveraged CatBoost’s native text_features to treat serialized API fields as a single text column. Example generated text for a registration API:

v1 user register username password invite_code code msg data token

CatBoost automatically tokenizes, builds TF‑IDF and n‑gram features, and learns semantic patterns without manual dictionaries. Request‑header features that did not convey business semantics (e.g., is_browser, has_auth) were removed.

Self‑Learning Closed Loop

The iterative loop consists of three stages:

Offline Feature Extraction & Initial Training : Prepare historical labeled data, extract textual, keyword, and statistical features, train the model, and deploy offline batch jobs.

Confidence‑Based Filtering : Periodically predict unlabeled APIs; only predictions with confidence ≥ 95 % are fed back as new training samples. This threshold balances label quality and coverage.

High‑Precision Label Feedback : For high‑confidence samples, a small random subset (≈5 %) is manually verified before inclusion. Low‑confidence samples are examined with SHAP to understand feature contributions and correct mislabeled cases.

Images illustrating the pipeline and feature flows are omitted for brevity.

Lessons Learned

Pseudo‑Label Trap : Systematic bias (e.g., misclassifying config_setting as login) can be amplified if high‑confidence pseudo‑labels are blindly added. Monitoring confusion matrices and pausing label flow for problematic classes mitigates this.

Fixed Threshold Illusion : A uniform 0.95 threshold works well for strong‑signal classes like login but underperforms for diverse classes like config_setting. Introducing per‑class dynamic thresholds improves overall precision.

Too Many Hard Examples : Over‑emphasizing difficult samples caused them to dominate >20 % of the training set, leading to over‑fitting and reduced accuracy on easy cases. Limiting hard‑example proportion to ≤10 % restored balance.

Takeaways

1. Feature engineering sets the lower bound of model performance, while the self‑learning mechanism defines the upper bound.

2. Domain expertise is crucial: understanding which keywords are strong signals and which request fields are discriminative often yields bigger gains than exhaustive hyper‑parameter tuning.

Future Outlook

The next steps include automating model retraining triggers, exploring AI‑driven workflows to reduce manual intervention in label feedback, and fine‑tuning lightweight LLMs on the company’s platform for richer semantic understanding.

machine learning feature engineering API Security classification SHAP self‑learning CatBoost

Written by

Huolala Safety Emergency Response Center

Official public account of the Huolala Safety Emergency Response Center (LLSRC)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.