Big Data 9 min read

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

This article explains Hadoop’s tiered storage concept, describing how data is classified by temperature—hot, warm, cold, frozen—and automatically moved across disk and archive layers to optimize cost and performance, with examples from Hadoop versions and eBay’s large‑scale deployment.

MaGe Linux Operations

Apr 7, 2015

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

1. Hadoop and Its Promise

Commercial off‑the‑shelf hardware can be assembled into a Hadoop cluster that provides massive storage and compute capacity. Data is split into many parts, stored on individual machines, and the processing logic runs on the same machines, embodying the slogan “take compute to data”.

2. Data Temperature

Datasets start as “hot” when they are heavily accessed, then cool to “warm”, “cold”, and finally “frozen” as usage declines. Typical thresholds are:

Age < 7 days – ~20 accesses per day – HOT

7 days – 1 month – ~5 accesses per week – WARM

1 – 3 months – ~5 accesses per month – COLD

3 months – 3 years – ~2 accesses per year – FROZEN

3. HDFS Tiered Storage

Since Hadoop 2.3, HDFS supports tiered storage. Each DataNode has a local directory (configured by dfs.datanode.dir) for regular blocks. Additional layers, such as ARCHIVE, are defined by the StorageType enum and a prefix like [ARCHIVE] in the directory configuration. Administrators can define multiple layers.

4. Mapping Data Temperature to Storage Layers

Hot data is kept on the high‑performance DISK layer. Warm data stores most replicas on DISK with one replica in ARCHIVE. Cold data retains at least one replica on DISK, with remaining copies in ARCHIVE. Frozen data moves all replicas to the ARCHIVE layer, which has minimal compute capability.

5. Cross‑Layer Data Flow

When data is first added, it resides in the default DISK layer. Based on its temperature, one or more replicas are moved to the ARCHIVE layer by a “mover”. The mover accepts an HDFS path, replica count, and target layer, then schedules block copies from source nodes to destination nodes.

6. Changes in Hadoop 2.6

Hadoop 2.6 introduces storage policies that can be attached to directories to label them as HOT, WARM, COLD, or FROZEN. Policies define how many replicas belong in each layer. Administrators can modify a directory’s policy and trigger the mover to enforce the new placement.

7. Application Usage

From an application’s perspective, the physical location of data is transparent. Even when all replicas of frozen data reside in ARCHIVE, applications can read it like any other HDFS file. However, reading from ARCHIVE nodes incurs extra network traffic because those nodes lack compute power. If this becomes costly, the data can be re‑classified as warm or cold and moved back to DISK.

8. eBay’s Tiered Storage Deployment

eBay operates a 40 PB Hadoop cluster and added 10 PB of low‑cost, compute‑limited storage marked as the ARCHIVE layer. New machines provide 220 TB each, and directories are labeled warm, cold, or frozen. Based on temperature, replicas are migrated to ARCHIVE, achieving a four‑fold cost reduction per GB compared to the DISK layer.

9. Summary

Storage without compute is cheaper than storage with compute. By classifying data temperature, organizations can ensure that high‑performance storage is used efficiently while moving less‑used replicas to low‑cost archive storage. HDFS provides the necessary tiered‑storage features and mover tools, and large‑scale users such as eBay have successfully applied them for data archiving.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HDFS Hadoop Tiered Storage Data Temperature

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Hadoop and Its Promise

2. Data Temperature

3. HDFS Tiered Storage

4. Mapping Data Temperature to Storage Layers

5. Cross‑Layer Data Flow

6. Changes in Hadoop 2.6

7. Application Usage

8. eBay’s Tiered Storage Deployment

9. Summary

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

6. Changes in Hadoop 2.6