Big Data 10 min read

Cutting Hadoop Storage Costs: Replication, Compression, Tiering & Erasure Coding

This article shares practical strategies used in a multi‑petabyte Hadoop environment to slash storage expenses, covering reduced replication, selective compression formats, tiered storage policies, and erasure coding, while weighing trade‑offs in reliability, performance, and operational complexity.

dbaplus Community

Apr 25, 2019

Cutting Hadoop Storage Costs: Replication, Compression, Tiering & Erasure Coding

1. Reduce Replication Factor

HDFS normally stores three replicas for high availability, which triples storage cost. To cut cost, the team lowered the replica count for non‑critical temporary files to two, accepting a modest risk of reduced availability. A script periodically scans the tmp Hive database and resets replica numbers to two, saving roughly one‑third of storage consumption and millions of RMB.

2. Compression

Beyond deleting data, compression is a straightforward way to shrink storage. The most common Hadoop compression formats differ in native processing speed and splitability (whether a file can be processed by multiple mappers). The table below shows the formats:

A benchmark comparison (image) indicates that bzip2 has unacceptably slow compression/decompression speeds, while gzip offers the highest compression ratio with moderate speed. lzo and snappy provide a good balance of speed and ratio, and because lzo is splitable, it is preferred for many workloads.

Raw logs are ingested with snappy to balance space and speed.

Periodically archive cleaned logs, converting snappy files to gzip for extra space savings.

Structured Hive tables are stored as parquet+gzip, leveraging Parquet’s performance while using gzip for compression.

This approach achieves a practical trade‑off between storage consumption and query performance.

3. Hot/Cold Tiered Storage

HDFS supports heterogeneous storage types (ARCHIVE, DISK, SSD, RAM_DISK) and storage policies that map data temperature to these types. Typical policies include:

Hot : frequently accessed data, all replicas on DISK.

Warm : mixed access, one replica on DISK, others on ARCHIVE.

Cold : rarely accessed, all replicas on ARCHIVE.

All_SSD / One_SSD : SSD‑based policies (usually excluded for cost reasons).

In the production cluster, a custom "frozen" tier was added for the coldest data, stored using HDFS erasure coding (RAID‑like protection). Erasure coding reduces storage overhead compared to replication but is best suited for data that is old and seldom accessed.

The cost breakdown shows that ARCHIVE machines cost about one‑third of DISK machines, making tiered storage financially attractive.

4. Large‑Capacity Storage Nodes

With 10‑GbE networks becoming ubiquitous, the team experimented with consolidating DISK and ARCHIVE roles onto a single class of large‑capacity storage nodes, eliminating the need for separate hardware tiers. Network I/O is no longer the bottleneck, and the approach simplifies capacity planning for massive cold‑data archives.

Future work includes evaluating HDFS Federation to mitigate NameNode metadata bottlenecks as petabyte‑scale datasets continue to grow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Storage Optimization compression erasure-coding HDFS Hadoop

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.