How Ctrip Built a Scalable Real‑Time User Behavior System with Kafka, Storm, and Redis
Ctrip’s real‑time user behavior service, a core foundation for recommendations, ads, and user profiling, was redesigned with a Java‑based stack (Kafka, Storm, Redis, MySQL) to achieve millisecond‑level latency, high availability, and ten‑fold scalability across processing and output flows.
1. Architecture
To meet the growing demand of the travel market, Ctrip redesigned its real‑time user behavior system using a Java‑based stack: Kafka, Storm, Redis, MySQL, Tomcat, and Spring.
The new architecture separates data into two flows: a processing flow and an output flow.
In the processing flow, client logs (App/Online/H5) are sent to a Collector Service, which forwards messages to a distributed queue. A stream‑processing framework (Storm) reads from the queue, processes the data, and writes it to a data layer composed of a distributed cache (Redis) and a MySQL cluster.
The output flow is simpler: a web service pulls data from the data layer and serves it to downstream callers such as recommendation systems or front‑end history displays.
Figure 1: Logical view of the real‑time user behavior system
2. Real‑time Performance
The system must handle traffic spikes, failure retries, data backlogs, and bug‑induced reprocessing. Storm addresses spikes with its ability to scale out by adjusting worker counts and restarting topologies.
Storm provides at‑least‑once delivery, which Ctrip adopts to minimize data loss while supporting idempotent processing.
For retry scenarios, a dual‑queue design is used: Queue1 holds fresh data, while Queue2 stores failed records. Workers consume from Queue1; on failure they write to Queue2, allowing a separate RetryWorker to handle retries without blocking fresh data processing.
Figure 2: Dual‑queue design
When data backlog occurs, workers can reset their consumption cursor to the latest offset, while a BackupWorker processes the missed range and then stops.
Figure 3: Backlog resolution strategy
3. Availability
High availability is critical because many downstream services depend on this foundation. The system eliminates single points of failure through full‑stack clustering: Kafka and Storm run in clustered mode, web services are stateless behind load balancers, and Redis and MySQL use primary‑secondary replication.
When a component becomes unavailable, graceful degradation is applied. For example, if MySQL fails, a DB‑degrade switch stops writes to MySQL while Storm continues writing to Redis. Data written to Redis remains queryable, and failed writes are queued in a Kafka retry topic to be replayed once MySQL recovers.
Figure 4: DB degradation flow
Redis degradation follows a similar pattern, with reduced throughput but continued service. Circuit‑breaker protection (Netflix Hystrix) prevents cascading failures by short‑circuiting calls after a failure threshold is reached.
Figure 5: Circuit breaker pattern
4. Performance & Scalability
To support ten‑fold traffic growth, Ctrip prepared both Redis and MySQL for horizontal scaling. Redis uses consistent hashing, allowing new nodes to be added with minimal data movement.
MySQL is sharded using a power‑of‑two scheme. Expansion involves promoting a replica to primary, updating DNS, and synchronizing data, all completed within minutes with minimal downtime thanks to the built‑in DB‑degrade mechanism.
5. Deployment
Storm deployments are straightforward: upload the new JAR and restart the topology. Offsets stored in Kafka enable the system to resume processing from the correct position after a restart.
When multiple versions of behavior records must run concurrently, a backup job is created to handle the legacy version alongside the primary pipeline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
