ClickHouse Real‑Time Analytics at QQ Music: Challenges, Solutions, and Tencent Cloud Best Practices
QQ Music replaced its slow Hive warehouse with a massive ClickHouse cluster, achieving sub‑second to ten‑second query latency on petabyte‑scale data, enabling real‑time analytics for non‑technical users, and following five operational best practices—ZooKeeper planning, idempotent writes, sensible partitions, read‑write separation, and localized joins—while leveraging Tencent Cloud’s managed ClickHouse service.
QQ Music faced challenges with its Hive-based offline data warehouse, including low timeliness (T+1 reports), low usability for product and operations teams, and inefficient processes requiring weeks to deliver analysis results.
To address these issues, QQ Music adopted a ClickHouse cluster with near‑million‑core scale, petabyte‑level storage, tens of trillions of records, and daily ingest of hundreds of billions of rows, enabling sub‑second to ten‑second query latency.
The adoption brought real‑time analytics and improved usability: complex queries run in seconds, and non‑technical users can build reports via Superset, with over half of the tens of thousands of charts created by product, R&D, operations, and finance staff.
Operational experience highlighted five key practices: (1) proper ZooKeeper cluster planning—use SSD‑based machines when data exceeds TB level; (2) idempotent writes—retry writes on the same shard to avoid duplication; (3) sensible partition design—keep partitions under 1,000, typically by day or month, avoiding excessive ID‑based partitioning; (4) read‑write separation—use temporary clusters for ingest and then switch or copy data to isolate write load from online queries; (5) cross‑table query localization—organize data so that related IDs reside on the same shard, replacing costly GLOBAL IN/JOINS with local operations.
The article explains ClickHouse features such as materialized views (which act as triggers that update target tables on detail‑table changes) and the AggregatingMergeTree engine, which pre‑aggregates data during merges to speed up queries, illustrated with user‑online duration and BitMap examples for retention and ad‑targeting scenarios.
ClickHouse’s core principles are outlined: insert‑only, fast SELECT, primary/secondary indexes, non‑blocking INSERT/SELECT, background merges, non‑unique primary key, and the LSM‑like MergeTree storage format that builds ordered parts and merges them in the background.
Performance tuning recommendations cover hardware (high‑frequency CPU, ample memory, SSD/NVMe for large data, RAID‑5/10/50), configuration parameters (max threads, memory limits, external sort/group‑by thresholds, background pool size), and usage tips (avoid full scans, use partition pruning, skip SELECT *, employ materialized views for dashboards, prefer IN over JOIN, batch writes, tune merge threads, set quotas, and manage ZooKeeper with SSD and auto‑cleanup).
Tencent Cloud’s ClickHouse service, launched in April, offers minute‑level provisioning, second‑level monitoring, and flexible configuration; upcoming features include elastic scaling, data self‑balancing, a performance‑enhanced version addressing ZooKeeper bottlenecks, and COS integration.
The Q&A section clarifies pros and cons, fault‑tolerance via replication, file‑descriptor limits, COS usage for cold data, restart optimizations for large tables, high‑concurrency write tuning, data‑loss safeguards, ZooKeeper optimizations, materialized view vs. MergeTree performance, machine sizing for billions of rows, deep pagination, wide‑table suitability, MySQL migration considerations, Docker production use, timeout handling, and resource‑quota enforcement.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.