Databases 21 min read

How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

This article details Qunar's transition to StarRocks as a unified OLAP engine, covering the business background, engine evaluation, architecture redesign, observability, high‑availability strategies, query‑performance optimizations, real‑world application cases, community contributions, and future plans.

StarRocks
StarRocks
StarRocks
How Qunar Migrated to StarRocks: Architecture, Performance Gains & Best Practices

Qunar's next‑generation data platform replaces a fragmented stack of Trino, Presto, Druid, Impala, Kudu, Iceberg, ClickHouse and others with StarRocks as a single, high‑performance OLAP engine. The new cluster spans dozens of nodes, serves millions of daily page‑views, and delivers sub‑second latency for external tables and millisecond latency for internal tables.

Business Background

Qunar's data platform supports a variety of products—QBI dashboards, quality‑inspection systems, ad‑hoc SQL analysis, Fun Analysis, offline audience building, and real‑time marketing. Different workloads historically relied on different engines:

Trino/Presto : Hive‑based analytics for dashboards and ad‑hoc queries.

Impala/Kudu : Real‑time ingestion from business‑system events, with Hive for batch and Kudu for low‑latency federation.

Druid : Kafka‑driven real‑time ingestion for dashboards.

ClickHouse : Offline Hive imports for user analysis and offline audience.

The multi‑engine architecture caused compatibility issues, performance bottlenecks, and high operational costs, prompting a redesign.

Engine Selection & Evaluation

StarRocks, ClickHouse, Trino and Kylin were benchmarked across scenario coverage, query performance, and operational difficulty. StarRocks excelled and became the chosen solution.

1.1 Scenario Coverage

StarRocks supports both batch and streaming ingestion, complex joins, and high‑concurrency workloads.

1.2 Write‑Capability Comparison

StarRocks : Real‑time ingestion via Flink/Kafka (broker load) and native UPSERT support through a primary‑key model.

ClickHouse : Optimized for bulk loads (INSERT INTO SELECT) but requires Kafka engine tables + materialized views for streaming.

Trino : Pure query layer; writes depend on underlying storage (HDFS/S3) and cannot optimise data layout.

Kylin : Cube‑based jobs with hour‑ or day‑level latency; no real‑time updates.

1.3 Operational Cost Considerations

StarRocks reduces the number of services to manage, simplifying deployment and maintenance.

StarRocks Overview

StarRocks is a next‑generation, MPP‑style distributed database designed for OLAP workloads. It consists of a lightweight front‑end (FE) for query parsing and planning, and back‑ends (BE) plus compute nodes (CN) for execution. When data resides locally, BE nodes are used; for object‑store or HDFS data, CN nodes handle computation. The system is self‑contained, horizontally scalable, and includes metadata replication for high availability.

New Data Platform Architecture

StarRocks now serves as the unified compute engine, replacing all previous engines. The redesign simplifies operations, cuts costs, and boosts overall query performance.

Observability – Metric Monitoring

StarRocks metrics were integrated with Qunar's internal monitoring system. Key steps included:

Enriching StarRocks with additional metrics (e.g., HDFS/MetaStore latency).

Deploying a dedicated collector to stream metrics to the central watcher service.

Running resident agents on each node to automatically handle alerts and perform self‑healing actions such as node restarts.

These measures provide real‑time health visibility and rapid fault recovery.

High Availability – Cluster Disaster Recovery

Two independent clusters (StarRocks‑1 and StarRocks‑2) were built. A unified query service routes dashboard traffic to StarRocks‑1 and mail‑service traffic to StarRocks‑2, achieving physical isolation. During off‑peak hours, idle CN nodes in StarRocks‑1 are shut down and their resources re‑allocated to StarRocks‑2 to meet night‑time demand. Resource groups further isolate workloads, preventing contention.

Query Performance – Optimizations

Performance tuning focused on the SQL planning pipeline:

Deep analysis of lexical, syntactic, logical‑plan generation, logical‑plan rewriting, cost‑based optimizer (CBO), and physical‑plan creation.

Identifying that logical‑plan generation and rewriting have the greatest impact on execution plans.

Implementing constant‑folding rules (e.g., jodatime_format) to pre‑compute expressions.

Applying partition pruning and other rewrite rules to minimize data scans.

These changes reduced query P95 latency from 5.7 s (Feb) to 2.4 s (May) for the QBI dashboard.

Application Practice – QBI Dashboard

The QBI dashboard serves the entire company with millions of PV/month. Migration began in March and completed in April, yielding a near‑doubling of query speed (P95 from 5.7 s to 2.4 s).

Challenges & Solutions

Result Consistency : Developed an intelligent comparison toolchain that captures full SQL, replays it, and performs multi‑dimensional result validation (numeric tolerance, NULL equivalence, sorting normalization). Automated root‑cause analysis classifies mismatches.

Dynamic Routing : Implemented smart routing based on consistency checks, enabling seamless dual‑engine switching and automatic fail‑over.

Syntax Compatibility : Extended StarRocks to achieve 99 % compatibility with Trino syntax, covering complex functions, DDL enhancements (CTAS, INSERT INTO), and date‑time functions (dow/week, date_parse, from_iso8601_timestamp).

Application Practice – Fun Analysis Migration

Fun Analysis is a self‑service multidimensional tool that combines offline Hive tables and real‑time Kudu streams. Migration steps included:

Dual‑write to StarRocks and Kudu with consistency checks.

Adding routing metadata to the query layer for engine selection.

Performance testing and syntax validation before cut‑over.

Low‑latency requirements were met by optimizing Flink pipelines: Kafka source → data transform → StarRocks sink (batching via Stream Load). Grouping data by project ensures timely batch thresholds, and metrics collection monitors latency.

StarRocks Contributions & Community

During the rollout, Qunar contributed several upstream improvements:

Raised Trino compatibility from 90 % to 99 %.

Enhanced the optimizer with syntax‑aware pruning.

Added custom parameters and metrics for self‑protection.

Submitted multiple pull‑requests to the StarRocks project.

StarRocks is an LF‑hosted open‑source Lakehouse engine (Apache‑2.0), with over 10 k GitHub stars and a global community of thousands of users across finance, retail, travel, gaming, and manufacturing.

Future Plans

Qunar intends to deploy StarRocks on a self‑managed Kubernetes cluster to leverage container orchestration for scalability and resilience. Additionally, materialized‑view layering and synchronous updates will be used to further accelerate real‑time analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Migrationobservabilityhigh availabilityStarRocksData PlatformOLAP
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.