Machine Learning Practices for Web Attack Detection in Ctrip's Nile System
This article describes how Ctrip's security team replaced rule‑based web attack detection with a Spark‑powered machine‑learning pipeline, detailing the system architecture, feature engineering using TF‑IDF, model training, evaluation, online deployment, and future enhancements to improve detection accuracy and performance.
Background : Traditional web attack detection relies on rule‑based blacklists, which are hard to maintain, cause false positives/negatives, and degrade performance when rule sets grow.
Problem : Maintaining rules is difficult, overly broad rules cause false alarms, overly narrow rules miss attacks, and regex engines can overload streaming platforms like Kafka.
Nile Architecture Introduction : The initial Nile system filtered >97% of traffic with a whitelist before applying a regex engine, then passed suspicious traffic to an automated vulnerability verification system (Hulk). In version 5, a Spark MLlib machine‑learning engine was added before the regex engine to pre‑filter traffic, improving throughput and detection speed.
Benefits of Adding Machine Learning :
Fast processing of most traffic, reducing Kafka backlog.
Ability to compare ML and regex results, allowing rule refinement.
Reduced reliance on regex for feature extraction, improving efficiency.
Defining the Target Problem : A binary classification task (attack vs. normal) with a miss‑rate requirement < 10%, and a fast prediction speed (excluding algorithms like K‑NN).
Data Collection and Feature Engineering : Labeled data were gathered from Elasticsearch logs, dynamic IP blacklists, and custom WAF alerts. Features were extracted using TF‑IDF on request parameters (e.g., counting occurrences of "eval", quotes, brackets). Example feature extraction code:
def get_evil_eval(url):
return len(re.findall("(eval)", url, re.IGNORECASE))Initially, regex‑based features were used but later replaced by TF‑IDF due to coverage and performance issues. Data cleaning steps included optimizing existing regexes, adding dynamic IP blacklists, using raw traffic as white samples, deduplication, removing encrypted requests, and applying a custom blacklist to filter mislabeled samples.
Model Training and Evaluation : Models were trained with sklearn (for prototyping) and Spark MLlib (production). Cross‑validation (50% train / 50% test) and hyper‑parameter tuning via GridSearchCV were employed. Evaluation used confusion matrices and classification reports. Sample evaluation code:
print("Confusion matrix:
%s" % metrics.confusion_matrix(expected, predicted))Results showed a recall of 0.94 (≈6% miss rate) and acceptable precision, indicating the model met the defined goals.
Online Deployment and Continuous Optimization : The trained model was integrated into the Nile framework with toggle switches for the ML engine and regex engine. Continuous monitoring, automatic rule generation, and periodic retraining are performed to maintain performance.
Future Outlook :
Improve handling of non‑standard JSON/XML payloads.
Introduce multi‑class classification to identify specific attack types.
Extend the approach to other domains such as malicious comments, image detection, and random domain detection.
Migrate from Spark MLlib to Spark ML for better tooling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
