Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned
This article describes how Ctrip migrated its large‑scale real‑time platform from JStorm to Spark Streaming, detailing the architectural design, the Muise Spark Core encapsulation, operational metrics, encountered pitfalls, and future plans to adopt Flink and Beam for streaming workloads.
With the rapid development of Internet technologies, users demand higher timeliness, accuracy, and stability for data processing, making the construction of a stable, easy‑to‑use real‑time computing platform with comprehensive monitoring and alerting a major challenge for many companies.
Since the first Ctrip real‑time platform was built in 2015, continuous technical evolution has grown the cluster to over a hundred nodes, supporting hundreds of real‑time applications across various business units. The platform now primarily relies on JStorm and Spark Streaming, and a previous sharing about the platform can be found in the Ctrip real‑time big data platform practice article.
This sharing focuses on how Ctrip builds a real‑time computing platform based on Spark Streaming, covering the following aspects:
Spark Streaming vs JStorm
Spark Streaming design and encapsulation
Spark Streaming practice at Ctrip
Pitfalls encountered
Future outlook
Spark Streaming vs JStorm
Before adopting Spark Streaming, Ctrip had run JStorm stably for a year and a half. The main reasons for switching include JStorm’s low level of abstraction, lack of built‑in support for windows, state, and SQL, and the higher development complexity for real‑time applications. Spark Streaming provides high‑level operators, tight integration with Spark SQL, and a unified API for both streaming and batch processing, reducing development and maintenance costs.
Additionally, Spark offers better support for SQL and MLlib compared to Flink 1.2, aligning with many departments that already use Spark SQL and MLlib for offline tasks.
The following table summarizes the comparison:
Spark Streaming
JStorm
Guarantee
Exactly Once
At Least Once
Latency
High
Very Low
Throughput
High
Low
Processing Model
Micro‑batch
Streaming
Stateful Operation
Yes
No
Window Operation
Yes
No
SQL Support
Yes
No
Levels of Abstraction
High Level
Low Level
Spark Streaming Design and Encapsulation
To embed Spark Streaming seamlessly into the existing platform, Ctrip wrapped it into a suite called Muise Spark Core , which provides metadata management, monitoring, and alerting features.
3.1 Muise Spark Core
Muise Spark Core is a second‑level encapsulation of Spark Streaming that supports multiple message queues, including HermesKafka, native Kafka (Direct Approach), Hermes MySQL, and Qmq (Receiver approach). Its key features are:
Automatic Kafka offset management
Support for Exactly‑Once and At‑Least‑Once semantics
Metric registration system allowing custom metrics
Alerting based on system and user‑defined metrics
Long‑running jobs on YARN with fault‑tolerance mechanisms
3.1.1 Automatic Kafka Offset Management
Muise Spark Core automatically reads and stores Kafka offsets, relieving users from offset handling. It validates offset validity and, when an offset becomes invalid after a long pause, resets it to the oldest valid offset. The following diagram shows a minimal Spark Streaming job using Muise Spark Core:
By default, jobs continue from the last stored offset, but users can also specify the consumption start point in three ways (see the second diagram).
3.1.2 Exactly‑Once Implementation
Achieving end‑to‑end Exactly‑Once semantics requires guaranteeing it at the source, processing, and storage stages. Using Kafka Direct API together with Spark RDD operators provides Exactly‑Once guarantees for the source and processing stages. For storage, idempotent or transactional writes (e.g., HDFS append, HBase, Elasticsearch, Redis) are required.
Muise Spark Core modifies Spark’s source code so that DirectKafkaInputStream can accept both From‑Offset and End‑Offset, and stores both offsets for each batch. This design ensures that after a crash or manual restart, the first batch after restart processes exactly the same data as the last batch before the crash.
Depending on how users handle the first batch, they can achieve:
At‑Most‑Once (ignore the first batch)
At‑Least‑Once (process the first batch without deduplication)
Exactly‑Once (use the generated UID based on topic, partition, and offset to deduplicate)
3.1.3 Metrics System
Muise Spark Core extends Spark’s native metrics system with custom metrics and exposes a registration interface for users. Metrics are periodically flushed to Graphite and trigger alerts via the company’s monitoring platform, visualized in Grafana.
Three built‑in metrics are provided:
Fail : alerts when task failures exceed 4 within a batch interval
Ack : alerts when processed data volume is zero
Lag : alerts when consumption delay exceeds a configured threshold
Because most jobs enable Back‑Pressure, batch processing time is usually within limits, but Kafka may still accumulate data. Therefore, the average difference between processing time and event time is calculated; if it exceeds the batch interval, a lag alert is raised.
3.1.4 Fault Tolerance
Beyond Exactly‑Once, running Spark Streaming long‑term on YARN requires additional configurations for stability. Details are covered in the “Long running Spark Streaming Jobs on YarnCluster” article.
Muise Portal also provides periodic health checks for running Spark Streaming jobs. Every five minutes, it queries YARN’s REST API for each job’s Application ID; if a job is not running, it attempts a restart.
Muise Portal
Muise Portal is a management console that supports both Storm and Spark Streaming jobs, offering features such as job creation, JAR upload, start/stop controls, and configuration management. The screenshot below shows the job creation UI.
Spark Streaming jobs run in YARN cluster mode and are submitted via the Spark client in Muise Portal. The following diagram illustrates the job execution flow.
Overall Architecture
The diagram below presents the current overall architecture of Ctrip’s real‑time platform.
Spark Streaming in Ctrip’s Practice
Spark Streaming is applied in four main scenarios at Ctrip:
ETL
Real‑time reporting
Personalized recommendation
Risk control and security
ETL
Compared with tools like Camus or Flume, Spark Streaming supports more complex logic, flexible resource scheduling on YARN, and seamless integration with Spark SQL. An example is the Data Lake of the Vacation department, which uses Spark Streaming for ETL and stores results in Alluxio, while custom metrics monitor data quality.
Real‑time Reporting
Spark Streaming jobs aggregate data per batch and push results to external systems such as Elasticsearch for downstream visualization. The following image shows a real‑time dashboard built for Ctrip IBU.
Personalized Recommendation & Risk Control
Both use cases require online feature extraction and model inference. After integrating Spark Streaming with Spark MLlib, the security team achieved a ten‑fold performance improvement over JStorm and reduced false‑positive rates by 20%.
Pitfalls Encountered
Running Spark Streaming jobs on the same YARN cluster as batch jobs introduces stability risks, especially during cluster upgrades. Future plans include separating the real‑time and batch clusters.
Typical issues observed:
Kafka leader loss causing java.lang.RuntimeException: No leader found for partition xx . The fix is to set spark.streaming.kafka.maxRetries > 1 and configure refresh.leader.backoff.ms for retry intervals.
Out‑of‑memory errors (PermGen space) when Spark SQL generates code. Resolved by adding -XX:MaxPermSize=1024m -XX:PermSize=512m to spark.driver.extraJavaOptions .
Permission denied when Spark SQL tries to create the warehouse directory on HDFS. Resolved by setting spark.sql.warehouse.dir to a local path, e.g., config("spark.sql.warehouse.dir","file:${system:user.dir}/spark-warehouse") .
Future Outlook
Current limitations of Spark Streaming include a simple window API, cumbersome Event‑Time aggregations, and a lack of new features in recent releases. Consequently, Ctrip is exploring Flink, which offers advanced watermarks and richer streaming semantics, as well as Apache Beam for unified batch/stream processing.
Flink 1.4 is about to be released, and Structured Streaming in Spark 2.2 adds more capabilities. The team plans to evaluate Flink integration later this year and continue researching Beam and Stream SQL for complex real‑time scenarios.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.