Big Data 12 min read

Design and Implementation of Transparent Compression for Hadoop Using ZFS

The article presents a comprehensive solution for reducing Hadoop cluster storage consumption by applying ZFS‑based transparent compression and data‑governance techniques, detailing the technical background, design choices, implementation steps, performance optimizations, and observed storage savings.

Beike Product & Technology

Mar 9, 2018

Design and Implementation of Transparent Compression for Hadoop Using ZFS

The Big Data Architecture team at the company is responsible for the storage, compute, and real‑time streaming platforms, aiming to provide efficient OLAP engines and stable, open big‑data components.

As business volume grows, the Hadoop cluster faces increasing data ingestion, putting pressure on both compute and storage resources, especially storage. The default three‑replica policy ensures data safety but inflates storage needs to over 10 PB for a 3.2 PB dataset, exceeding the available 7 PB.

Two main approaches are considered: (1) data governance—classifying hot (≤ 3 months) and cold (> 3 months) data, then compressing, periodically deleting, or migrating cold data to cheaper storage; (2) optimizing the three‑replica strategy by using Erasure Coding (EC) in Hadoop 3.0 to achieve 1.5‑replica protection.

Given Hadoop 3.0 is not yet released, the team prioritizes data‑governance‑driven compression. Two compression schemes are evaluated: client‑side compression (low cost but limited control) and backend transparent compression (leveraging Hadoop storage policies and Linux/ZFS features). The latter is chosen.

Transparent compression relies on HDFS heterogeneous storage and ZFS filesystem compression. HDFS heterogeneous storage places hot data on SSD, warm data on disks, and cold data on archival media. ZFS provides pool management, self‑healing, variable‑block sizes, and built‑in compression (gz, lz4, etc.).

The implementation workflow includes identifying cold data, marking it, and using HDFS mover with a filtered list to relocate data to ZFS‑compressed volumes, avoiding full‑cluster scans. Optimizations such as Hive‑based partition selection, incremental mover runs, and ZFS ARC cache tuning (set to 10 GB) further improve performance.

Performance tests show ZFS‑gz compression offers the best trade‑off between compression ratio and read/write speed, while LZ4 provides slightly higher throughput. Disabling atime updates and using JBOD instead of RAID on DataNodes also contribute to speed gains.

After deploying transparent compression, storage usage dropped from an estimated 9.3 PB to about 6 PB (≈ 85 % of capacity), saving roughly 3 PB and the equivalent of 40 servers. Users experience no impact because compression is fully transparent.

Future work includes exploring Intel QAT acceleration for faster compression/decompression, combining EC coding with ZFS compression, and implementing intelligent data‑warm‑up based on access patterns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Storage Optimization Data Governance Hadoop ZFS transparent compression

Written by

Beike Product & Technology

As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.