How Ctrip Scaled UBT Analytics by Migrating from ClickHouse to StarRocks
Ctrip's User Behavior Tracking (UBT) system, handling 30 TB of daily data, moved from ClickHouse to StarRocks' compute‑storage separated architecture, cutting average query latency from 1.4 seconds to 203 ms, halving storage, reducing nodes from 50 to 40, and boosting write throughput to 3 million rows per second.
Background
UBT (User Behavior Tracking) is Ctrip's core system for collecting and analyzing user events across Android, iOS, and NodeJS clients, generating about 30 TB of new data each day and retaining it for up to a year. Typical queries involve UID/VID‑based log detail lookups and aggregate statistics.
Original Architecture and Pain Points
Data entered UBT were first written to Kafka, then consumed via two paths:
goHangout → ClickHouse for real‑time troubleshooting and simple statistics.
Flink → Hive for cold storage and complex analytics.
ClickHouse presented several critical issues:
Write problems: data loss and consumption backlog, especially for large historical back‑fills that triggered heavy partition compaction, exhausting CPU and I/O.
Horizontal scaling: ClickHouse’s monolithic compute‑storage design required costly data migration during scaling, and high hardware requirements plus triple‑replication inflated storage costs.
Performance bottlenecks: long‑range time‑window queries were slow, failing to meet real‑time analysis needs.
goHangout also suffered from single‑process stability and cumbersome configuration.
Migration to StarRocks
The team chose StarRocks for its compute‑storage separation. Early StarRocks used an integrated architecture, but the newer version decouples compute nodes (stateless) from remote object storage (OSS/S3), allowing on‑demand scaling and a single local replica for reliability.
Storage Design
Partition key switched from business time (ClickHouse) to system time (StarRocks), writing only to the current partition and eliminating fragmentation.
Hourly partitions with 128 buckets improve concurrent writes.
Compression format changed from LZ4 to ZLIB, saving ~30 % storage.
DataCache sized for 7‑day scans, covering >90 % of queries.
Compaction Optimization
Control file count and thread numbers.
Leverage hourly partitions and bucket design to reduce compaction frequency.
Enable MergeCommit during writes to merge small versions into larger ones, cutting file count and I/O.
Monitoring uses native_compactions and partitions_metas tables, plus show proc compactions command, with Grafana dashboards to keep Compaction Score below 100.
Data Migration Strategy
Historical data (~300 TB) were moved with SparkLoad (Hive → StarRocks). Compared to StreamLoad (GB‑scale) and BrokerLoad (TB‑scale), SparkLoad handles TB‑plus volumes with minimal cluster impact by cleaning data in Spark, storing to HDFS, and letting StarRocks CN pull the result, bypassing MemoryStore and reducing compaction load.
Real‑time Incremental Ingestion
Flink now writes directly to StarRocks with MergeCommit enabled. MergeCommit aggregates many small requests into a single commit, reducing version count, I/O operations, and improving throughput. Tests show IOPS dropping from ~140 to <10 and a 10× boost in write I/O size.
Results and Benefits
After migration:
Average query time fell from 1.4 s to 203 ms (≈1/7).
P95 latency dropped from 8 s to 800 ms (≈1/10).
Storage reduced from 2.6 PB (ClickHouse, 2‑replica) to 1.2 PB (StarRocks, 1‑replica).
Node count decreased from 50 to 40.
Write throughput stabilized at 3 million rows/second.
Future plans include completing the migration of all compute‑storage integrated clusters to a fully separated architecture and extending the lake‑house model for broader query flexibility.
Key Takeaways
Switching to a compute‑storage separated engine like StarRocks can dramatically improve query latency, reduce storage costs, simplify scaling, and enhance write performance for massive event‑tracking workloads.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
