Cutting Hadoop Storage Costs: Replication, Compression, Tiering & Erasure Coding
This article shares practical strategies used in a multi‑petabyte Hadoop environment to slash storage expenses, covering reduced replication, selective compression formats, tiered storage policies, and erasure coding, while weighing trade‑offs in reliability, performance, and operational complexity.
1. Reduce Replication Factor
HDFS normally stores three replicas for high availability, which triples storage cost. To cut cost, the team lowered the replica count for non‑critical temporary files to two, accepting a modest risk of reduced availability. A script periodically scans the tmp Hive database and resets replica numbers to two, saving roughly one‑third of storage consumption and millions of RMB.
2. Compression
Beyond deleting data, compression is a straightforward way to shrink storage. The most common Hadoop compression formats differ in native processing speed and splitability (whether a file can be processed by multiple mappers). The table below shows the formats:
A benchmark comparison (image) indicates that bzip2 has unacceptably slow compression/decompression speeds, while gzip offers the highest compression ratio with moderate speed. lzo and snappy provide a good balance of speed and ratio, and because lzo is splitable, it is preferred for many workloads.
Raw logs are ingested with snappy to balance space and speed.
Periodically archive cleaned logs, converting snappy files to gzip for extra space savings.
Structured Hive tables are stored as parquet+gzip, leveraging Parquet’s performance while using gzip for compression.
This approach achieves a practical trade‑off between storage consumption and query performance.
3. Hot/Cold Tiered Storage
HDFS supports heterogeneous storage types (ARCHIVE, DISK, SSD, RAM_DISK) and storage policies that map data temperature to these types. Typical policies include:
Hot : frequently accessed data, all replicas on DISK.
Warm : mixed access, one replica on DISK, others on ARCHIVE.
Cold : rarely accessed, all replicas on ARCHIVE.
All_SSD / One_SSD : SSD‑based policies (usually excluded for cost reasons).
In the production cluster, a custom "frozen" tier was added for the coldest data, stored using HDFS erasure coding (RAID‑like protection). Erasure coding reduces storage overhead compared to replication but is best suited for data that is old and seldom accessed.
The cost breakdown shows that ARCHIVE machines cost about one‑third of DISK machines, making tiered storage financially attractive.
4. Large‑Capacity Storage Nodes
With 10‑GbE networks becoming ubiquitous, the team experimented with consolidating DISK and ARCHIVE roles onto a single class of large‑capacity storage nodes, eliminating the need for separate hardware tiers. Network I/O is no longer the bottleneck, and the approach simplifies capacity planning for massive cold‑data archives.
Future work includes evaluating HDFS Federation to mitigate NameNode metadata bottlenecks as petabyte‑scale datasets continue to grow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
