Industry Insights 7 min read

Unlocking Massive Data Deduplication: PBBA Appliances vs Backup Software

Backup environments generate abundant duplicate data, making deduplication essential; this article examines how purpose‑built backup appliances (PBBA) and leading backup software implement variable‑length, global deduplication, compare scale‑out versus scale‑up architectures, and discuss performance trade‑offs and CPU bottlenecks.

Architects' Tech Alliance

Feb 12, 2016

Unlocking Massive Data Deduplication: PBBA Appliances vs Backup Software

Deduplication in Purpose‑Built Backup Appliances

In backup scenarios, users typically run daily incremental backups and a weekly full backup, creating a large amount of duplicate data that is ideal for deduplication. Dedicated backup appliances (Purpose‑Built Backup Appliances, PBBA) and Virtual Tape Library (VTL) products now include deduplication as a core feature.

Market leaders such as EMC Data Domain dominate the PBBA market, followed by HP’s B‑series products, which together hold most of the market share. HP’s B‑series and Feikon’s VTL solutions implement online, variable‑length, global deduplication. Data streams from the front‑end backup software are broken into variable‑size chunks; each chunk’s SHA‑1 fingerprint is calculated and sent to a fingerprint database for duplicate detection.

Because backup appliances store massive amounts of data with relatively small chunk sizes (a few kilobytes), managing a fingerprint for every chunk would create an enormous index and degrade lookup performance. To address this, HP’s B‑series groups consecutive chunks into larger blocks (typically 1 MB) and selects representative fingerprints—such as the maximum or minimum hash value—to represent the whole block. This sampling dramatically reduces metadata size while preserving high deduplication ratios (10:1 to 20:1 or higher) and enables backup speeds exceeding 50 TB per hour.

HP’s B‑series supports horizontal scaling (Scale‑Out) with global deduplication across nodes, whereas EMC Data Domain relies on vertical scaling (Scale‑Up) by adding disks to a single node, despite claiming global deduplication. In practice, Scale‑Out better matches PBBA requirements because, after deduplication and compression, the backend disk bandwidth is rarely the bottleneck; CPU resources for chunking, fingerprinting, and compression become the limiting factor.

Deduplication in Backup Software

Backup software also performs deduplication after data is packaged by the client. Leading products—Veritas NetBackup, Dell EMC NetWorker, and CommVault Simpana—implement deduplication within the software stack. This article uses CommVault Simpana as an example.

In Simpana, incoming data is split into fixed‑size chunks, each hashed with SHA‑256 to generate a fingerprint. Fingerprint calculation can occur on the client or on the backup server. Fingerprints are stored in a dedicated Dedupe DataBase (DDB) and used to identify duplicate chunks. Unique chunks are compressed and written to the backend storage.

The performance of the DDB is critical; for larger deployments, Simpana recommends placing the DDB on solid‑state drives or a dedicated server to avoid bottlenecks.

Compared to PBBA deduplication, software‑based deduplication offers source‑side reduction, which is advantageous when network bandwidth is limited and a dedicated backup server is available. PBBA‑based deduplication, by contrast, performs reduction at the storage side, minimizing impact on production storage and leveraging ample backup bandwidth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Deduplication backup scale-out PBBA

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.