Operations 14 min read

How Qunar Built Prism: A Real‑Time Data Platform That Halves Deployment Time

This article describes how Qunar’s Prism platform combines ELK, Kafka, Spark, Docker and other open‑source tools to create a real‑time data pipeline that speeds up problem localization, reduces deployment time, and improves resource utilization across development and operations teams.

Efficient Ops
Efficient Ops
Efficient Ops
How Qunar Built Prism: A Real‑Time Data Platform That Halves Deployment Time

1. Introduction

Qunar's online real‑time data platform Prism is a useful tool for locating problems; its development was driven by operational needs, and this article is compiled from a global operations conference talk.

2. What is Prism

Prism was built with data visualization as the starting point and aims to reduce the cost of acquiring data and analytics software, creating a real‑time analysis platform. The red points in the diagram indicate the main goals.

Prism provides several services: a real‑time log service (ELK) for data collection, analysis and visualization; a data bus (Kafka) for high‑throughput distributed publish‑subscribe messaging used by multiple departments; a big‑data real‑time analysis system (Spark Streaming/Storm/Flink) for intelligent analysis; and data storage (Elasticsearch as a Service).

It also offers an OLAP/experiment platform (Zeppelin + Spark/Flink) where algorithm engineers can test their models on a Spark cluster before handing the results to engineers for production.

2.1 Data Flow in Prism

Brief overview of Prism’s data‑flow architecture.

The leftmost layer is the collection layer: Rsyslog gathers system logs, QFlume runs on each physical and virtual machine (custom), Heka previously collected Docker logs, Packetbeat captures network packets, and “Other” represents other business lines.

All data are aggregated into Kafka, then enter the ETL layer where Logstash performs initial processing and pushes structured data to Elasticsearch for testing. Logs are visualized in Kibana for engineers and product managers.

Non‑visible data are also sent to Kafka for further processing by popular big‑data frameworks such as Spark Streaming, Flink, etc., with some results stored in Elasticsearch.

Other data are written to the storage/database layer. After being loaded by Spark, Flink, Zeppelin, they are visualized through tools like Elasticsearch and Zeppelin.

The overall topology may evolve as new tools are added.

2.2 DevOps as the Starting Point

When product bugs appeared, the team held discussion meetings that highlighted the bottleneck of problem localization. Developers were responsible for both coding and operations, leading to a need for a unified platform.

The team realized that only DevOps knew how to query logs, and without a dedicated Ops role, developers had to handle both development and operations, making problem localization painful.

2.3 Solving the Problem with ELK

To lower the cost of data collection and troubleshooting, the team introduced the ELK stack.

Kibana provides an easy‑to‑use dashboard for data distribution and fault detection. ELK consists of Logstash (ETL), Elasticsearch (search), and Kibana (visualization). The system originated from a hotel project with hundreds of machines, where deployments used to take six‑seven hours overnight. After ELK, deployment time was cut roughly in half, allowing staff to finish by early morning.

ELK also accelerated issue detection, making troubleshooting faster and more accurate compared with manual log scanning.

3. New Frontiers in Architecture

As new requirements such as complex deployments and real‑time business adjustments emerged, the initial architecture became insufficient. After learning about Docker, the team incorporated Docker, Marathon, and Mesos into the overall design.

These solutions enabled rapid scaling of applications, better hardware utilization, and reduced the cost of data‑software deployment. For example, previously scarce real‑time resources became elastic and centrally managed.

Developers no longer need to handle their own releases; the platform starts clusters and abstracts operations away from them.

4. Higher‑Level Data Analysis

Following ELK’s success, more real‑time data needs arose. The team introduced a Spark scheduler on Mesos (Spark on Mesos) and later migrated to Marathon for better cluster management and Docker‑based distribution.

The new setup treats each task as a cluster, eliminating the need to pull images repeatedly and simplifying stateless deployments.

5. Summary

Qunar built a real‑time system that instantly detects failures, aggregates analysis tools, and resolves issues, effectively lowering the barrier to data‑software deployment and improving resource utilization in a Mesos environment.

Remaining challenges include load imbalance and slow anomaly detection, which will be addressed with further experience and future integration of GPU‑based neural‑network workloads.

DockerDevOpsKafkaReal-time DataELKSpark
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.