Big Data 13 min read

Design and Implementation of Ctrip's Real-Time User Behavior System

The article describes how Ctrip redesigned its real-time user behavior service using a Java‑Kafka‑Storm stack with Redis and MySQL, detailing the architecture, real‑time processing, availability, performance, scalability, and deployment strategies to handle billions of events daily.

Ctrip Technology

Apr 13, 2017

Design and Implementation of Ctrip's Real-Time User Behavior System

Ctrip's real‑time user behavior service is a foundational component used in scenarios such as recommendation, dynamic ads, user profiling, and browsing history.

Using the "Guess You Like" feature as an example, the article explains the need for cross‑business real‑time recommendations and the necessity of aggregating user behavior data across services.

The original system suffered from incomplete data coverage, inconsistent output formats, and a web‑service‑based log processing module that was hard to scale for traffic spikes.

To meet growing data volume and higher real‑time, stability, and performance requirements, a new architecture was designed.

In the new design, data flows through two streams: a processing stream and an output stream.

In the processing stream, client logs (App/Online/H5) are sent to a Collector Service, which forwards messages to a distributed queue; a stream‑processing framework reads from the queue, processes the data, and writes it to a data layer composed of distributed cache and database clusters.

In the output stream, a web service pulls data from the data layer and serves it to internal services (e.g., recommendation) or front‑end clients (e.g., browsing history). The technology stack includes Java, Kafka, Storm, Redis, MySQL, Tomcat, and Spring.

Java – widely used internally with mature big‑data components.

Kafka/Storm – Kafka provides a distributed message queue; Storm offers a mature stream‑processing framework with good operational support.

Redis – HA, SortedSet, and expiration features meet system needs.

MySQL – Chosen for stability and performance at billion‑record scale, with designed horizontal scalability.

The system processes about 2 billion events per day with ~300 ms latency from ingestion to availability, and serves roughly 80 million queries daily with an average 6 ms response time.

Real‑time Guarantees

The system addresses traffic spikes, failure retries, data back‑pressure, and bug‑induced reprocessing using Storm's scaling and at‑least‑once delivery semantics, combined with idempotent processing.

A dual‑queue design separates fresh data (Queue1) from exceptional data (Queue2). Workers consume fresh data from Queue1, while a RetryWorker handles Queue2, applying back‑off or re‑enqueue strategies as needed.

When data backlog occurs, workers can reset their consumption cursor to the latest offset, while a backupWorker processes the missed range and then stops.

Availability

High availability is ensured through full‑stack clustering (Kafka, Storm, stateless web services behind load balancers, Redis and MySQL master‑slave setups) eliminating single points of failure.

Graceful degradation is employed: when MySQL is down, data is written only to Redis and a retry queue; once MySQL recovers, the retry queue is consumed to achieve eventual consistency. Similar logic applies to Redis degradation.

The query service uses Netflix Hystrix for circuit‑breaker patterns, opening a circuit when failure thresholds are exceeded, then entering half‑open and closed states to recover safely.

Rate limiting at multiple dimensions (AppId/IP, service, interface) protects against abusive calls.

Performance & Scalability

To support ten‑fold capacity growth, the data layer (Redis and MySQL) is horizontally scalable. Redis uses consistent hashing for easy expansion; MySQL shards are split using powers‑of‑two, with master‑slave promotion and DNS switch‑over procedures.

The migration leverages MySQL replication to avoid manual sync, completing in minutes with minimal downtime.

Deployment

Storm deployments are simple: upload the new JAR and restart the topology. Offsets stored in Kafka allow the system to resume processing from the correct position after restarts.

For scenarios requiring multiple versions (e.g., different behavior record formats), a backup job runs the historic version alongside the current one.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture Real-time Processing mysql Storm

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.