Big Data 11 min read

Hot and Cold Data Separation in Big Data Systems

The article explains the concept of hot and cold data, why separating them reduces cost, and presents heterogeneous and homogeneous architectural solutions—including Elasticsearch, HBase, AWS S3, and cloud‑based UltraWarm—illustrated with network‑behavior and e‑commerce order system case studies.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Hot and Cold Data Separation in Big Data Systems

For any software system, regardless of business complexity, data ultimately manifests as CRUD operations, and its value diminishes over time, eventually leading to deletion. Data value is measured by how often it is queried or updated, which varies across business scenarios.

Based on usage frequency, data is classified into three stages: Hot, Warm, and Cold. Hot data is frequently accessed with response time requirements under 10 ms; Cold data is rarely updated, accessed occasionally, and can tolerate response times of 1–10 seconds. Warm data sits between the two and is usually merged into Hot or Cold to simplify the system.

The primary purpose of distinguishing hot and cold data is cost control: hot data demands high‑performance hardware (CPU, memory, SSD), while storing large volumes of low‑frequency data on such hardware is prohibitively expensive. If data volume is small or cost is irrelevant, a single monolithic system (e.g., MySQL) may suffice.

Common hot‑cold separation approaches involve using two distinct systems with different storage characteristics. These are referred to as “heterogeneous hot‑cold systems” and “homogeneous hot‑cold systems”.

Heterogeneous Hot‑Cold Systems

Key considerations include selecting appropriate hot and cold data stores, defining the time‑based split line, handling data migration, and managing cross‑system queries. Hot stores prioritize read/write performance (e.g., MySQL, Elasticsearch), while cold stores focus on low‑cost storage (e.g., HDFS, AWS S3) with a suitable query engine.

Case Study 1: Network Behavior Data Analysis System

Six months of network logs are retained; 90 % of queries target the most recent month. The system keeps 35 days of hot data in Elasticsearch for fast aggregation, while older data is stored as Parquet files in AWS S3 and queried via AWS Athena. Spark Streaming writes incoming data to Elasticsearch and backs it up to S3; a nightly Spark job moves the previous day's backup to cold storage.

Case Study 2: E‑Commerce Transaction Order System

Orders are stored in MySQL (InnoDB) for the first 90 days, providing transactional guarantees and low‑latency queries. After that, data is migrated to HBase as a wide table, with Elasticsearch indexing for search. Order changes are captured via MySQL binlog, streamed through Kafka, processed by Spark Streaming, and written to both HBase and Elasticsearch. Daily jobs shift the hot‑cold split forward and purge old hot partitions.

Homogeneous Hot‑Cold Systems

To avoid the complexity of maintaining two separate stacks, some frameworks embed hot‑cold capabilities within a single system. Elasticsearch 5.0 introduced node attributes (Hot, Warm, Cold) allowing indices to be allocated to high‑performance or low‑cost nodes. Index Lifecycle Management (ILM) in Elasticsearch 6.6+ automates the transition based on policies.

Cloud providers also offer integrated solutions. AWS Elasticsearch Service’s UltraWarm feature stores cold indices on S3 while keeping hot indices on provisioned nodes, enabling seamless hot‑cold migration via API calls.

Conclusion

The article surveys various hot‑cold separation strategies for big‑data workloads, noting that while homogeneous solutions simplify access patterns, heterogeneous cold stores (HDFS, S3) remain essential for long‑term, low‑cost archival and analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchHBaseBig Data Architecturecold dataData Lifecyclehot dataAWS S3
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.