Evolution of Ctrip's Data Platform: From Version 1.0 to 2.0 for Risk Control
This article describes how Ctrip's information security team redesigned its data platform from a simple RabbitMQ‑MySQL pipeline to a scalable, real‑time and offline big‑data architecture using Kafka, Storm, Hadoop, Spark, and a custom count server, dramatically improving processing capacity and supporting risk‑control operations.
In recent years, with the rapid growth of e‑commerce and internet finance, Ctrip has faced increasing malicious activities such as fake registrations, login attacks, and order brushing, prompting the need for a robust risk‑control data platform.
Data Platform 1.0 relied on RabbitMQ to collect business data, a data engine for cleaning and calculation, and MySQL for storage. Processing logic was embedded in SQL statements within the code and scheduled via Quartz jobs. This approach worked while data volume was low but soon encountered bottlenecks as traffic exploded.
Key pain points of version 1.0 were:
Single data source heavily dependent on RabbitMQ, causing memory crashes when the engine could not keep up.
SQL logic hard‑coded in the application, requiring code redeployment for any update.
Data Platform 2.0 was rebuilt to address these issues by focusing on three areas: data collection & integration, real‑time & offline computation, and task scheduling with hot‑updates.
1. Data Sources expanded beyond the risk‑control feed to include business logs, behavior logs, and HTTP logs from various BUs. Data is streamed via Kafka or MQ, normalized into unified models, and persisted to HDFS.
2. Streaming & Real‑Time Computation uses Storm to consume Kafka/MQ streams, applying statistical rules in bolts, followed by a custom count server that shards data for further processing. Results are cached and stored in Redis for rapid incremental calculations, achieving sub‑second response times. A timeout monitor handles data back‑pressure.
The count server stores events in time slots, enabling fast aggregation such as counting IP accesses within a 10‑minute window without costly DB queries.
3. Offline Computation leverages the Hadoop‑Spark stack. As data grew, MapReduce and Spark jobs increased, and their outputs are written to MySQL, Redis, HBase, and Elasticsearch for downstream services.
4. Task Scheduling & Hot‑Updates decouples tasks into units and dynamic rules. Rules are packaged via Zookeeper and pushed to task units, allowing automatic scheduling and hot‑updates without redeploying the whole service. A testing unit validates rule changes before deployment, improving reliability.
5. Results The new platform processes nearly 3 billion records daily—a 30‑fold increase over version 1.0—while intercepting a large volume of malicious requests and reducing developer involvement through service‑orientation.
Conclusion Building a data platform that supports business needs requires close alignment with use cases, scalable architecture, and the ability to evolve with growing data demands. Ctrip's experience demonstrates that a well‑designed, extensible platform can deliver both operational efficiency and robust risk‑control capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
