Design and Architecture of Ctrip's Real‑Time User Behavior Service
The article describes how Ctrip rebuilt its real‑time user behavior platform using a Java‑based stack (Kafka, Storm, Redis, MySQL) to achieve millisecond‑level latency, high availability, scalable performance, and robust handling of traffic spikes, failures, and data back‑pressure.
Ctrip's real‑time user behavior service is a foundational component used in scenarios such as recommendation, dynamic ads, user profiling, and browsing history. The original system suffered from incomplete data coverage, inconsistent output formats, and a web‑service‑based log processor that could not easily scale for traffic spikes.
To meet the rapidly growing travel market and stricter real‑time, stability, and performance requirements, a new architecture was designed. Data flows in two directions: a processing stream and an output stream. Client logs (App/Online/H5) are sent to a Collector Service, which forwards messages to a distributed queue (Kafka). A stream‑processing layer (Storm) reads from Kafka, processes the events, and writes results to a data layer composed of Redis caches and a MySQL cluster.
The output stream simply pulls processed data from the data layer and serves it to downstream services such as recommendation engines or front‑end history displays. The technology stack includes Java, Kafka, Storm, Redis, MySQL, Tomcat, and Spring.
Key real‑time capabilities are achieved with Storm, which handles traffic spikes by scaling out workers, supports at‑least‑once delivery, and provides simple deployment (jar upload and topology restart). A dual‑queue design separates fresh data (Queue1) from exceptional data (Queue2); workers consume fresh data while a RetryWorker processes failed records, applying back‑off and retry policies.
Availability is ensured through full‑stack clustering (Kafka, Storm, stateless web services, Redis master‑slave, MySQL master‑slave), eliminating single points of failure. When components become unavailable, the system degrades gracefully: for example, if MySQL is down, Storm continues writing to Redis and a retry queue; once MySQL recovers, the retry queue is consumed to restore consistency. Circuit‑breaker patterns (Netflix Hystrix) and multi‑dimensional rate limiting further protect downstream services.
Performance and scalability requirements demand support for a ten‑fold capacity increase. Redis uses consistent hashing for seamless scaling, while MySQL is horizontally sharded with power‑of‑two shard counts, allowing rapid promotion of replicas during expansion. Migration leverages MySQL replication to minimize manual steps and downtime.
Deployment is straightforward: Storm topologies are updated by uploading new JARs and restarting; Kafka offsets allow the system to resume processing from the correct position. Multi‑version operation is supported via backup jobs for legacy behavior record formats.
Overall, the redesign delivers sub‑300 ms end‑to‑end latency, processes around 2 billion events per day, and serves roughly 80 million queries daily with average response times of 6 ms, meeting Ctrip's stringent real‑time, availability, and scalability goals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
