Big Data 28 min read

How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip

This article details how Ctrip's senior engineering manager leveraged ClickHouse to build a high‑availability, sub‑second response data platform handling nearly 700 billion rows, describing the motivations, architecture, data synchronization processes, performance gains, challenges, and practical recommendations for large‑scale analytics.

dbaplus Community
dbaplus Community
dbaplus Community
How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip

Why ClickHouse was chosen

Ctrip needed a storage engine that could handle non‑fixed query conditions, rapidly growing data volumes (≈700 billion rows, 1.8 TB after compression), diverse business scenarios, and sub‑second response times. Traditional relational databases and Elasticsearch could not meet these requirements efficiently.

Key characteristics of ClickHouse

High compression ratio reduces storage cost.

Very fast bulk‑write speed (50‑200 MB/s per node) suitable for massive daily updates.

SQL‑compatible syntax (MySQL‑like) with column‑oriented storage and sparse indexes that fully utilize CPU and memory.

No left‑hand‑rule requirement for joins , but joins must be written with the larger table on the left and may need intermediate temporary tables.

Limitations : no true ACID transactions, limited high‑concurrency handling, and join complexity when more than two tables are involved.

Data platform architecture

Data is extracted from Hive and loaded into ClickHouse via ETL jobs (≈3 000 jobs per day). Over 80 % of the analytical data resides in ClickHouse, providing a real‑time analytics layer for hotel services.

Architecture diagram:

Architecture diagram
Architecture diagram

Full data synchronization workflow

Truncate temporary table A_temp and import the latest Hive data into it.

Rename the current table A to A_temp_temp.

Rename A_temp to A (making the new data live).

Rename A_temp_temp back to A_temp for the next cycle.

Incremental data synchronization workflow

Truncate A_temp and import the most recent three months of Hive data.

Select data older than three months from A into A_temp.

Rename A to A_temp_temp.

Rename A_temp to A.

Rename A_temp_temp to A_temp.

Cache‑based protection mechanisms

Two‑layer caching protects the ClickHouse cluster:

Active cache : After ETL jobs finish, a job sets a cache flag. Subsequent queries first check the cache (pre‑populated with hot data) before hitting ClickHouse.

Passive cache : User‑driven queries gradually populate the cache.

Active caching stores two or three copies of the data (Hive → MySQL → ClickHouse) and uses a monitoring job to ensure consistency before serving queries.

Cluster design

Instead of ClickHouse’s built‑in distributed mode (which requires ZooKeeper and can suffer node‑failure cascades), a “virtual cluster” approach is used:

Each virtual cluster consists of at least two machines located in different data‑centers for high availability.

Data is isolated per virtual cluster, allowing independent scaling, writes, and maintenance.

SSD storage is preferred to minimise restart times and I/O latency.

CPU and memory usage are continuously monitored; high CPU (>60 %) often indicates slow queries that need investigation.

Virtual‑cluster diagram:

Virtual cluster diagram
Virtual cluster diagram

Performance impact

Before ClickHouse, tens of thousands of queries exceeded 15 seconds. After migration:

99 % of PC queries respond within 3 seconds.

Mobile queries >2 seconds dropped from hundreds per day to only a few.

Performance charts:

Query latency before/after
Query latency before/after

Practical lessons and recommendations

Choose partition keys carefully; fine‑grained partitions can create millions of small files and exhaust disk space.

Pre‑sort data on the partition key before loading to reduce the number of background merges.

Place the larger table on the left side of a join; monitor join size changes and adjust if the smaller table becomes larger.

Adopt a distributed ClickHouse deployment only when a single node cannot hold the data or CPU load.

Continuously monitor CPU and memory spikes; a sudden rise often signals a slow query that needs optimisation.

Prefer SSDs for storage to accelerate restarts and data loading.

Minimise redundant textual columns to reduce I/O overhead.

ClickHouse excels in large‑volume, controllable‑QPS analytics and log‑type workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time analyticsClickHousedata compressionData Architecture
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.