Sliding Window and SVM Based Web Crawler Detection System Design
This article describes a flexible web crawler identification approach that combines sliding‑window data collection with Support Vector Machine classification, detailing the underlying concepts, feature extraction, system architecture, client‑server interaction, and deployment steps for practical use.
Background : Traditional single‑strategy crawler detection is becoming ineffective as crawlers use rotating proxy IPs, random request intervals, and simulated user behavior, making detection difficult. To address this, a flexible and intelligent detection method based on sliding windows and machine learning is proposed.
Design Idea : A sliding window aggregates user requests over time or count dimensions. When the window reaches a predefined size, the collected data triggers a detection cycle; after processing, the oldest half of the data is discarded and the window continues to slide forward.
The article illustrates three stages: (1) initial accesses that do not fill the window, (2) window filling that initiates a detection run, and (3) post‑detection cleanup and continuation.
SVM Overview : Support Vector Machine (SVM) is employed as a generic classification model. After training on labeled user and crawler logs, the SVM predicts whether new request sequences belong to a crawler.
Combining Sliding Window and SVM : The system first extracts a sequence of requests from the sliding window, derives feature vectors (e.g., request counts, UA counts, referer ratios, success status ratios, cookie ratios, trace URL ratios, etc.), and feeds these vectors into the trained SVM for prediction.
Feature Set includes: size – window size for average response time calculations urlCount – number of valid URLs in the window uaCount – number of distinct User‑Agent strings referRatio – proportion of non‑empty, valid referers successStatusRatio – proportion of successful responses cookieIdentityRatio – proportion of requests carrying cookies (future work: validate cookie content) traceUrlRatio – proportion of trace URLs generated by front‑end JS (crawlers typically do not execute JS) userPath – business‑specific path matching (not implemented due to time constraints)
System Architecture : The solution follows a client‑server model. Client side : Fetches remote blacklist from cache and applies configurable interception strategies. Standardizes log output to a unified format (example Java fields shown below). Registers with the server at startup for policy synchronization. Server side : SVM learning module trained on labeled user and crawler logs. Online extraction and interception module that periodically pulls logs from ELK, builds sliding windows per user, and triggers detection when windows fill. Dashboard displaying real‑time detection results and ranking of suspected crawlers. Configuration module for distributing common interception policies to clients. Cache layer to store large volumes of sliding‑window data and shared black/white lists.
Client Log Data Structure (Java) :
private String ip;
private Date requestTime;
private String requestUrl;
private String requestMethod;
private String statudCode;
private String referer;
private String userAgent;
private String qunarGlobal;
private String sessionId;
private String previous;
private String current;
private List<Entity> cookies;
private List<Entity> headers;How to Use : Integration is low‑cost—simply add the client JAR and a filter, ensure logs are sent to ELK, and provide the log endpoint to the server for configuration. After registration, the system begins real‑time crawler detection with minimal operational overhead.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.