Fundamentals 8 min read

Deduplication and Compression Techniques in Primary Storage: Differences from Backup Scenarios

This article examines how deduplication and compression technologies, widely used in backup environments, are adapted for primary storage systems—particularly HDD arrays—by analyzing differences in I/O size, patterns, performance requirements, resource allocation, and implementation approaches of major vendors such as NetApp and EMC.

Architects' Tech Alliance

Jan 28, 2016

Deduplication and Compression Techniques in Primary Storage: Differences from Backup Scenarios

Deduplication and compression are the most effective techniques for saving storage space and are commonly applied in primary memory, backup software, and flash storage. This article focuses on implementing deduplication in primary storage environments, especially HDD array systems.

1. Differences between primary storage and backup scenarios

While deduplication succeeded in backup contexts and naturally migrated to primary storage, the I/O models differ significantly. Primary storage (mechanical HDD arrays) aims to reduce disk count and cost while maintaining performance.

Key differences include:

I/O size: Backup I/O is typically large (megabytes), whereas primary storage I/O (e.g., Virtual Desktop Infrastructure) is usually 8–32 KB.

I/O pattern: Backups are mostly sequential reads/writes; primary storage sees a high proportion of random reads/writes, with about 30 % random I/O and roughly 90 % of I/O being overwrite operations in VDI workloads.

Performance requirements: Backups prioritize bandwidth with modest latency constraints, while primary storage demands high IOPS and low latency, making any added deduplication processing a potential latency penalty.

Feature positioning: Deduplication is mandatory in backup systems but optional in primary storage, leading to far fewer CPU and memory resources allocated to it.

2. Implementation differences

Because of the above gaps, primary storage deduplication differs from backup deduplication in several ways:

Timing: Most vendors perform deduplication as a post‑process to avoid impacting production performance, unlike the online deduplication used in backups.

Chunking: Primary storage deals with small, scattered I/O and non‑contiguous logical block addresses, making variable‑length chunking inefficient; fixed‑size (e.g., 4 KB) chunking is preferred.

Duplicate detection: Backup systems often use sampling because of strong data locality; primary storage, with weaker locality, generally avoids sampling or uses minimal sampling.

3. Example implementations from major vendors

NetApp’s FAS series (and EMC’s VNX/VNX2) illustrate typical post‑process, fixed‑length deduplication:

Data is chunked in real time (4 KB chunks), fingerprints are stored in a change‑log, and data is written to disk; optional online compression may be applied before writing.

During scheduled idle windows, the system sorts fingerprints, builds a fingerprint database, and performs duplicate detection.

Identical fingerprints trigger byte‑by‑byte comparison; exact matches lead to pointer updates, reference‑count adjustments, and space reclamation.

NetApp supports both online and post‑process compression, with the latter re‑compressing data missed by the online stage.

EMC’s VNX initially offered file‑level deduplication/compression for NAS and block‑level compression for SAN (post‑process only); VNX2 later added true block‑level deduplication, still as a post‑process operation.

Overall, adapting deduplication to primary storage requires careful consideration of I/O characteristics, performance impact, and resource constraints.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deduplication backup compression NetApp EMC Primary Storage

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.