Hot and Cold Data Separation in Big Data Systems
The article explains the concept of hot and cold data, why separating them reduces cost, and presents heterogeneous and homogeneous architectural solutions—including Elasticsearch, HBase, AWS S3, and cloud‑based UltraWarm—illustrated with network‑behavior and e‑commerce order system case studies.
For any software system, regardless of business complexity, data ultimately manifests as CRUD operations, and its value diminishes over time, eventually leading to deletion. Data value is measured by how often it is queried or updated, which varies across business scenarios.
Based on usage frequency, data is classified into three stages: Hot, Warm, and Cold. Hot data is frequently accessed with response time requirements under 10 ms; Cold data is rarely updated, accessed occasionally, and can tolerate response times of 1–10 seconds. Warm data sits between the two and is usually merged into Hot or Cold to simplify the system.
The primary purpose of distinguishing hot and cold data is cost control: hot data demands high‑performance hardware (CPU, memory, SSD), while storing large volumes of low‑frequency data on such hardware is prohibitively expensive. If data volume is small or cost is irrelevant, a single monolithic system (e.g., MySQL) may suffice.
Common hot‑cold separation approaches involve using two distinct systems with different storage characteristics. These are referred to as “heterogeneous hot‑cold systems” and “homogeneous hot‑cold systems”.
Heterogeneous Hot‑Cold Systems
Key considerations include selecting appropriate hot and cold data stores, defining the time‑based split line, handling data migration, and managing cross‑system queries. Hot stores prioritize read/write performance (e.g., MySQL, Elasticsearch), while cold stores focus on low‑cost storage (e.g., HDFS, AWS S3) with a suitable query engine.
Case Study 1: Network Behavior Data Analysis System
Six months of network logs are retained; 90 % of queries target the most recent month. The system keeps 35 days of hot data in Elasticsearch for fast aggregation, while older data is stored as Parquet files in AWS S3 and queried via AWS Athena. Spark Streaming writes incoming data to Elasticsearch and backs it up to S3; a nightly Spark job moves the previous day's backup to cold storage.
Case Study 2: E‑Commerce Transaction Order System
Orders are stored in MySQL (InnoDB) for the first 90 days, providing transactional guarantees and low‑latency queries. After that, data is migrated to HBase as a wide table, with Elasticsearch indexing for search. Order changes are captured via MySQL binlog, streamed through Kafka, processed by Spark Streaming, and written to both HBase and Elasticsearch. Daily jobs shift the hot‑cold split forward and purge old hot partitions.
Homogeneous Hot‑Cold Systems
To avoid the complexity of maintaining two separate stacks, some frameworks embed hot‑cold capabilities within a single system. Elasticsearch 5.0 introduced node attributes (Hot, Warm, Cold) allowing indices to be allocated to high‑performance or low‑cost nodes. Index Lifecycle Management (ILM) in Elasticsearch 6.6+ automates the transition based on policies.
Cloud providers also offer integrated solutions. AWS Elasticsearch Service’s UltraWarm feature stores cold indices on S3 while keeping hot indices on provisioned nodes, enabling seamless hot‑cold migration via API calls.
Conclusion
The article surveys various hot‑cold separation strategies for big‑data workloads, noting that while homogeneous solutions simplify access patterns, heterogeneous cold stores (HDFS, S3) remain essential for long‑term, low‑cost archival and analytics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
