Evolution and Architecture of Ctrip's Online Risk Control System (Aegis)
The article details the design, major revisions, and performance optimizations of Ctrip's online risk control platform, describing its shift from a .NET‑SQL Server monolith to a Java‑based modular architecture with custom TSDB, big‑data services, and distributed rule execution to handle billions of daily risk events.
In 2011 Ctrip launched its online risk control system to combat growing payment fraud, and today the platform processes over 100 million risk events per day and more than 100 billion pre‑real‑time data points, supporting over 10 000 rules and 20+ models across payment, resource‑grabbing, and other business risks.
The overall architecture consists of a decision engine, Counter service, blacklist, user‑profile, offline processing, offline analytics, and monitoring modules; the system has undergone three major redesigns.
**First version (2011)** – built with .NET services and SQL Server; all decision logic, blacklist, and traffic calculation were implemented directly in the database, leading to severe latency as traffic queries grew.
**Second revision – traffic query performance optimization** – introduced sharding and table partitioning, separating traffic and business databases and using hash‑based data distribution, which dramatically improved query throughput.
New business demands then required faster integration, external data sources, richer rule logic, 10× traffic growth with sub‑second response, and a migration from .NET to Java.
**Third revision – Aegis 3.0** – migrated the core to Java, modularized services, and adopted Drools scripts for rule authoring, enabling rule deployment within minutes. The engine now has synchronous (real‑time) and asynchronous (validation, data distribution) components, both stateless for easy scaling.
The custom Counter Server, a TSDB‑like service, provides arbitrary‑precision, arbitrary‑window queries in ≤5 ms, handling billions of daily queries and boosting performance by orders of magnitude.
Additional services include Risk Portrait (real‑time user and order profiling), DataProxy (unified external API with caching guaranteeing 99.9 % of requests under 10 ms), blacklist service, configuration service, event handling platform, and performance & business monitoring services.
Two further subsystems, Sessionizer and DeviceID, built on Ctrip’s proprietary big‑data engine **Chloro**, provide real‑time session aggregation and device fingerprinting, further improving rule accuracy.
Ongoing performance optimizations include distributed parallel rule execution (reducing average latency to ~200 ms), compiling Drools scripts to Java classes (doubling rule execution speed and bringing overall latency below 100 ms), and implementing a Java model execution engine for random‑forest and logistic‑regression models, delivering an order‑of‑magnitude speedup over Python implementations.
These enhancements allow the platform to scale simply by adding servers while maintaining sub‑second response times even under 10× traffic growth, and the roadmap now focuses on further platformization and productization of the risk‑control services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
