Cold‑Hot Storage Architecture and Practices in Shopee ClickHouse
Shopee ClickHouse implements a cold‑hot storage architecture by mounting JuiceFS‑backed S3 as a remote volume, using table‑level storage policies and TTL‑driven background moves to shift data from SSD to cold storage, achieving cost‑effective scalability, improved disk utilization, and validated performance despite operational challenges such as Redis memory pressure and small‑file handling.
Shopee ClickHouse is a high‑availability distributed analytical database derived from the open‑source ClickHouse engine. This article describes the cold‑hot storage architecture built on JuiceFS and S3, the implementation details, performance benchmarks, operational issues, and practical solutions.
Overall Cluster Architecture
ClickHouse provides a column‑store OLAP engine with vectorized execution. Shopee ClickHouse adds a second‑generation development and architecture evolution layer. The cluster consists of an SLB for request routing, a Proxy layer for query routing, a ClickHouse DB cluster (both SSD‑based compute clusters and SATA‑based storage clusters), and a remote object store (Shopee STO S3) used for cold data.
Cold‑Hot Storage Solution
The cold‑hot design mounts remote object storage via the JuiceFS client, exposing it as a local path. Table‑level storage policies allow different teams to define their own hot‑cold thresholds. New tables specify the cold‑storage policy at creation, and existing tables are altered to use the policy. Data is moved between SSD and S3 based on TTL expressions.
Typical storage policy configuration (simplified):
storage_policy = 'hcs_ck'Example CREATE TABLE with TTL and storage policy:
CREATE TABLE db_name.hcs_table_name (
`log_time` DateTime64(3),
`log_level` String,
...
`create_time` DateTime DEFAULT now()
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/db_name.hcs_table_name', '{replica}')
PARTITION BY toYYYYMMDD(log_time)
ORDER BY (ugi, ip)
TTL toDateTime(log_time) TO VOLUME 'v_ssd',
toDateTime(log_time) + toIntervalDay(7) TO VOLUME 'v_cold',
toDateTime(log_time) + toIntervalDay(14) DELETE
SETTINGS index_granularity = 16384,
storage_policy = 'hcs_ck',
parts_to_throw_insert = 1600;Benchmark Results
Two storage back‑ends (S3 and Ozone) were benchmarked using the ClickHouse TPCH Star Schema (1000‑scale). Insert performance favored S3, while query latency varied significantly. Sample query latency (seconds) is shown in the table below:
Query No.
JuiceFS (S3+Ozone)
⅓ JuiceFS + ⅔ SSD
SSD only
Q1.1
8.884
8.966
1.417
Q1.2
0.921
0.998
0.313
Q2.1
68.148
36.273
5.450
Q4.3
33.667
2.813
2.357
Based on overall cost, stability, functionality, performance and scalability, S3 was chosen as the cold storage medium.
Implementation Details
Data parts are moved by a background thread pool ( background_move_pool ) distinct from the general background pool. The mover selects parts whose TTL indicates they should be relocated, creates a clone on the remote volume, and then detaches the original part. The move operation is transparent to queries because the active part is always either on SSD or on S3.
Key code snippets:
std::optional<BackgroundProcessingPool> background_move_pool; // thread pool for background moves MergeTreePartsMover::selectPartsForMove MergeTreeData::moveParts void MergeTreeData::swapActivePart(MergeTreeData::DataPartPtr part_copy) { ... }Operational Issues and Solutions
Redis Memory Growth
JuiceFS stores metadata in Redis; a large number of small files can exhaust Redis memory, causing mount failures. Monitoring clickhouse_merge metrics and setting alerts when merge frequency drops below 0.5 per three hours helps detect problems early.
Handling Excess Small Files
When many tiny parts are already on S3, stop further moves, locate offending tables via error logs or system.parts , and either stop moves with SYSTEM STOP MOVES [db.]table or manually move parts back to SSD using ALTER TABLE table MOVE PARTITION ... TO VOLUME 'ssd_volume' .
JuiceFS S3 Read/Write Failures
Errors such as “send request to S3 host name certificate expired” are diagnosed via JuiceFS logs (e.g., cat /var/log/messages | grep 'juicefs' ) and resolved by the S3 team. Monitoring the metric juicefs_object_request_errors provides early alerts.
ClickHouse Server Startup Failures
Inconsistent TTL definitions between local .sql files and ZooKeeper can prevent server start. Ensure TTL changes are applied to both ZooKeeper and local metadata before restarting.
Suspicious Broken Parts
During move operations, ClickHouse creates a marker file #DELETE_ON_DESTROY_MARKER_PATH . Accumulation of such parts triggers the suspicious_broken_parts error, preventing startup. Remedies include deleting the problematic table’s metadata and recreating it, or creating a force_restore_data flag under /flags and restarting.
Monitoring Metrics
JuiceFS: juicefs_object_request_errors
Redis: Memory usage
ClickHouse: clickhouse_merge metric
Benefits
After deploying cold‑hot tiering, average disk usage dropped from ~85% to ~75%, enabling two additional data services and storing >90 TB of compressed data on S3. The architecture also improves scalability and reduces storage costs.
Conclusion
The JuiceFS + S3 cold‑hot tiering solution provides flexible, cost‑effective storage for ClickHouse workloads. Ongoing work includes extending JuiceFS to HDFS and further separating storage and compute.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.