Databases 19 min read

Migrating Tencent Music's Data Infrastructure from ClickHouse and Druid to StarRocks: Strategy, Implementation, and Best Practices

This article details how Tencent Music’s data‑infrastructure team migrated thousands of ClickHouse and Druid nodes to a StarRocks compute‑storage‑separated lakehouse, achieving 40‑50% cost reduction while maintaining query performance, and shares the technical challenges, solutions, and best‑practice recommendations gathered during the process.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Migrating Tencent Music's Data Infrastructure from ClickHouse and Druid to StarRocks: Strategy, Implementation, and Best Practices

Tencent Music Entertainment Group, a leading online music service in China, operates a unified data platform that supports data ingestion, processing, querying, and governance for products such as QQ Music, Kugou Music, and others.

Background : In 2023 the data‑infrastructure team migrated more than a thousand ClickHouse and Druid cluster nodes to a StarRocks compute‑storage‑separated architecture, cutting costs by 50% while keeping performance stable.

Architecture Evolution : The previous architecture used ClickHouse for analytical workloads (thousands of nodes, daily data volume in the hundred‑billion range) and Apache Druid for monitoring and real‑time multi‑dimensional analysis (over 10 PB total). Limitations of ClickHouse (lack of lakehouse focus, coupled compute‑storage) and Druid (segment bottlenecks, slow recovery) prompted the search for a new solution.

Lakehouse Selection : After evaluating query performance, operational difficulty, community activity, lakehouse capabilities, and resource isolation, the team chose StarRocks as the compute engine because its product roadmap aligns with Tencent Music’s lakehouse vision.

Migration Strategy : The migration was performed in stages, starting with a small‑impact ClickHouse real‑time subset, using a gray‑scale rollout to allow rapid error correction. The roadmap included:

ClickHouse migration implementation

Druid migration implementation

ClickHouse Migration Challenges & Solutions :

Data persistence difference – ClickHouse writes locally, while StarRocks stores data in COS; required adjustments to write strategies and COS configuration.

SQL compatibility – an AST‑based SQL Rewriter was built to convert ClickHouse‑specific syntax (window clauses, ARRAY JOIN, etc.) to StarRocks syntax. The tool is open‑source at StarRocks SQLTransformer .

Data consistency during migration – a ClickHouse JDBC Catalog was contributed to StarRocks to enable hybrid queries across migrated and unmigrated data.

During the migration, a fusion query gateway automatically routed queries to the appropriate cluster, ensuring zero‑impact user experience.

Druid Migration Implementation :

Enable full‑copy dual‑write.

Generate StarRocks table DDL and RoutineLoad statements from Druid metadata via scripts.

Download Druid segment files, decode to CSV, and ingest into StarRocks using StreamLoad.

Migration Effects :

Performance: ClickHouse P99 latency (~8 s) remained unchanged after migration to StarRocks; Druid real‑time queries stayed around 4 s for large tables.

Cost: Storage cost halved; overall cost reduced by 40 % for ClickHouse replacement and >50 % for Druid replacement.

Operations: Cluster scaling time dropped from 1‑2 weeks to 1‑3 days; resource utilization improved from 75 % to 85 %.

Best Practices :

Distribute data across multiple COS buckets to avoid bandwidth limits.

Leverage StarRocks Storage Volume for flexible bucket allocation.

Optimize write throughput: adjust batchSize , tune write‑backpressure, and increase cluster thread counts.

Monitor and tune compaction scores; increase compaction threads when scores rise.

Configure local disk cache appropriately; avoid mismatched DataCache windows that cause excessive cold‑data reads.

Use materialized views to offload heavy calculations and reduce Kafka/OLAP traffic.

Summary & Future Plans : The team completed the first real‑time cluster migration, achieving comparable P95 query performance and a 40 % cost reduction. Future work includes migrating additional clusters, expanding materialized‑view‑driven ETL, and further deepening lakehouse integration.

data migrationperformance optimizationStarRocksClickHouseDruidLakehouseCost Reduction
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.