Deduplication and Compression Techniques in Primary Storage: Differences from Backup Scenarios
This article examines how deduplication and compression technologies, widely used in backup environments, are adapted for primary storage systems—particularly HDD arrays—by analyzing differences in I/O size, patterns, performance requirements, resource allocation, and implementation approaches of major vendors such as NetApp and EMC.
Deduplication and compression are the most effective techniques for saving storage space and are commonly applied in primary memory, backup software, and flash storage. This article focuses on implementing deduplication in primary storage environments, especially HDD array systems.
1. Differences between primary storage and backup scenarios
While deduplication succeeded in backup contexts and naturally migrated to primary storage, the I/O models differ significantly. Primary storage (mechanical HDD arrays) aims to reduce disk count and cost while maintaining performance.
Key differences include:
I/O size: Backup I/O is typically large (megabytes), whereas primary storage I/O (e.g., Virtual Desktop Infrastructure) is usually 8–32 KB.
I/O pattern: Backups are mostly sequential reads/writes; primary storage sees a high proportion of random reads/writes, with about 30 % random I/O and roughly 90 % of I/O being overwrite operations in VDI workloads.
Performance requirements: Backups prioritize bandwidth with modest latency constraints, while primary storage demands high IOPS and low latency, making any added deduplication processing a potential latency penalty.
Feature positioning: Deduplication is mandatory in backup systems but optional in primary storage, leading to far fewer CPU and memory resources allocated to it.
2. Implementation differences
Because of the above gaps, primary storage deduplication differs from backup deduplication in several ways:
Timing: Most vendors perform deduplication as a post‑process to avoid impacting production performance, unlike the online deduplication used in backups.
Chunking: Primary storage deals with small, scattered I/O and non‑contiguous logical block addresses, making variable‑length chunking inefficient; fixed‑size (e.g., 4 KB) chunking is preferred.
Duplicate detection: Backup systems often use sampling because of strong data locality; primary storage, with weaker locality, generally avoids sampling or uses minimal sampling.
3. Example implementations from major vendors
NetApp’s FAS series (and EMC’s VNX/VNX2) illustrate typical post‑process, fixed‑length deduplication:
Data is chunked in real time (4 KB chunks), fingerprints are stored in a change‑log, and data is written to disk; optional online compression may be applied before writing.
During scheduled idle windows, the system sorts fingerprints, builds a fingerprint database, and performs duplicate detection.
Identical fingerprints trigger byte‑by‑byte comparison; exact matches lead to pointer updates, reference‑count adjustments, and space reclamation.
NetApp supports both online and post‑process compression, with the latter re‑compressing data missed by the online stage.
EMC’s VNX initially offered file‑level deduplication/compression for NAS and block‑level compression for SAN (post‑process only); VNX2 later added true block‑level deduplication, still as a post‑process operation.
Overall, adapting deduplication to primary storage requires careful consideration of I/O characteristics, performance impact, and resource constraints.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.