Artificial Intelligence 7 min read

Sliding Window and SVM Based Web Crawler Detection System Design

This article describes a flexible web crawler identification approach that combines sliding‑window data collection with Support Vector Machine classification, detailing the underlying concepts, feature extraction, system architecture, client‑server interaction, and deployment steps for practical use.

Qunar Tech Salon

Aug 22, 2017

Sliding Window and SVM Based Web Crawler Detection System Design

Background : Traditional single‑strategy crawler detection is becoming ineffective as crawlers use rotating proxy IPs, random request intervals, and simulated user behavior, making detection difficult. To address this, a flexible and intelligent detection method based on sliding windows and machine learning is proposed.

Design Idea : A sliding window aggregates user requests over time or count dimensions. When the window reaches a predefined size, the collected data triggers a detection cycle; after processing, the oldest half of the data is discarded and the window continues to slide forward.

The article illustrates three stages: (1) initial accesses that do not fill the window, (2) window filling that initiates a detection run, and (3) post‑detection cleanup and continuation.

SVM Overview : Support Vector Machine (SVM) is employed as a generic classification model. After training on labeled user and crawler logs, the SVM predicts whether new request sequences belong to a crawler.

Combining Sliding Window and SVM : The system first extracts a sequence of requests from the sliding window, derives feature vectors (e.g., request counts, UA counts, referer ratios, success status ratios, cookie ratios, trace URL ratios, etc.), and feeds these vectors into the trained SVM for prediction.

Feature Set includes:

size – window size for average response time calculations

urlCount – number of valid URLs in the window

uaCount – number of distinct User‑Agent strings

referRatio – proportion of non‑empty, valid referers

successStatusRatio – proportion of successful responses

cookieIdentityRatio – proportion of requests carrying cookies (future work: validate cookie content)

traceUrlRatio – proportion of trace URLs generated by front‑end JS (crawlers typically do not execute JS)

userPath – business‑specific path matching (not implemented due to time constraints)

System Architecture : The solution follows a client‑server model.

Client side :

Fetches remote blacklist from cache and applies configurable interception strategies.

Standardizes log output to a unified format (example Java fields shown below).

Registers with the server at startup for policy synchronization.

Server side :

SVM learning module trained on labeled user and crawler logs.

Online extraction and interception module that periodically pulls logs from ELK, builds sliding windows per user, and triggers detection when windows fill.

Dashboard displaying real‑time detection results and ranking of suspected crawlers.

Configuration module for distributing common interception policies to clients.

Cache layer to store large volumes of sliding‑window data and shared black/white lists.

Client Log Data Structure (Java) :

private String ip;<br/>private Date requestTime;<br/>private String requestUrl;<br/>private String requestMethod;<br/>private String statudCode;<br/>private String referer;<br/>private String userAgent;<br/>private String qunarGlobal;<br/>private String sessionId;<br/>private String previous;<br/>private String current;<br/>private List<Entity> cookies;<br/>private List<Entity> headers;

How to Use : Integration is low‑cost—simply add the client JAR and a filter, ensure logs are sent to ELK, and provide the log endpoint to the server for configuration. After registration, the system begins real‑time crawler detection with minimal operational overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture machine learning svm Sliding Window client-server web crawler detection

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.