Ctrip’s Scalable Real‑Time User Behavior System with Kafka, Storm, Redis
This article details Ctrip’s redesign of its real‑time user behavior service, covering the new architecture, data flow, use of Java, Kafka, Storm, Redis, and MySQL, and how it achieves high real‑time performance, availability, scalability, and fault‑tolerance to support massive travel‑industry traffic.
1. Architecture
Ctrip’s real‑time user behavior service is a foundational service used in scenarios such as recommendation, dynamic ads, user profiling, and browsing history. To meet growing data volume and stricter real‑time, stability, and performance requirements, the system was redesigned with a two‑stream architecture: a processing stream and an output stream.
In the processing stream, client logs (App/Online/H5) are uploaded to a Collector Service, which forwards messages to a distributed queue. A stream‑processing framework (Storm) reads from the queue, processes the data, and writes it to a data layer composed of distributed cache (Redis) and a MySQL cluster.
The output stream is simpler: a Web Service pulls data from the data layer and serves it to callers, either internal services (e.g., recommendation) or front‑end components (e.g., browsing history). The technology stack includes Java, Kafka, Storm, Redis, MySQL, Tomcat, and Spring.
Java – mature big‑data components and strong internal adoption.
Kafka/Storm – proven distributed messaging and stream‑processing with solid operational support.
Redis – HA, SortedSet, and expiration features meet system needs.
MySQL – stable and performant at billion‑row scale, with horizontal scalability.
The system processes roughly 2 billion events per day with a latency of about 300 ms from ingestion to availability, and handles around 80 million query requests daily with an average latency of 6 ms.
2. Real‑Time Performance
Storm addresses traffic spikes by scaling out workers; it processes messages at millisecond granularity and supports at‑least‑once delivery with idempotent handling. A dual‑queue design separates fresh data (Queue1) from exception data (Queue2), allowing normal processing to continue while retries are managed independently.
Compensation strategies include retry workers that consume from Queue2, applying back‑off policies, and mechanisms to reset consumer cursors for backlog clearance.
3. Availability
The system eliminates single points of failure through full‑stack clustering: Kafka and Storm are natively clustered, Web services are stateless behind load balancers, and Redis/MySQL use primary‑backup deployments. When components become unavailable, graceful degradation routes data to Redis or MySQL fallback paths, and a circuit‑breaker (Hystrix) prevents cascading failures.
Rate limiting is applied at multiple levels (AppId/IP, service, interface) to protect the system from abusive calls.
4. Performance & Scalability
To support ten‑fold capacity growth, the data layer employs consistent‑hashing for Redis expansion and horizontal sharding for MySQL. Sharding follows powers‑of‑two for easier scaling, and MySQL master‑slave promotion is automated via DNS switching, reducing migration time to minutes.
5. Deployment
Storm deployments are straightforward: uploading a new JAR and restarting the topology. Kafka offsets allow the system to resume processing from the correct position after restarts. For versioned behavior records, a backup job runs older versions alongside the current one.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
