Fundamentals 14 min read

Understanding Data Deduplication and Compression in Backup Software

This article explains the core features of backup software, focusing on data deduplication and compression techniques, including source‑side, target‑side, and media‑side deduplication, parallel deduplication architectures, replication methods, and hardware snapshot integration, illustrated with SimPana and AnyBackup examples.

Architects' Tech Alliance

Jul 23, 2016

Understanding Data Deduplication and Compression in Backup Software

Earlier articles introduced the system architecture and network topology of backup software; this piece focuses on a key feature—data deduplication and compression. By deduplicating data before transmitting it over WAN to remote backup centers, organizations can significantly reduce bandwidth usage, lower rental costs, and save storage space on backup media.

Backup deduplication methods are diverse and are generally classified into source‑side deduplication, target‑side deduplication, and media‑side deduplication.

Source‑side deduplication is provided by the backup software and performed on the backup client; it effectively saves network bandwidth but may impact production workloads.

Target‑side deduplication is also provided by the backup software but occurs on the media server. Its advantage is that each client can share deduplication fingerprints, achieving global deduplication.

Media‑side deduplication is typically offered by the backup medium itself, such as VTLs or storage systems with built‑in deduplication. Physical tape libraries, due to their linear read/write nature, can only perform compression; when the backup software detects a tape library or VTL, it may disable deduplication. Media‑side deduplication can be Inline or Post‑line, with Inline providing greater storage savings.

Deduplication works by dividing data into fixed‑size blocks, computing a hash for each block, and comparing the hash against a deduplication hash database. If the hash already exists, the block is not stored again; only an index or pointer is recorded. If the hash is new, the block is physically stored and indexed, ensuring that identical blocks are stored only once physically while logical views remain complete.

In this article we analyze the deduplication, compression, and replication features of SimPana and AnyBackup backup software to help readers better understand these technologies.

SimPana Deduplication Principle

SimPana uses a two‑level fingerprint database architecture (SSDB and DDB) and supports both source‑side and target‑side deduplication, though it is generally recommended to enable only one for performance reasons. When a backup job starts, if the client has source‑side deduplication enabled, the client first compresses the data and then splits it into fixed‑size blocks, computing a SHA‑512 hash for each block.

If a hash is not found in the local source‑side database (SSDB), the client sends the hash to the media server (MA). The MA checks its DDB; if the hash is absent, the client transmits the data block and its hash index to the MA for storage, and both SSDB and DDB are updated. If the hash already exists in SSDB, the block is recognized as duplicate and only an index is recorded.

If the hash already exists in SSDB, the block is considered duplicate and only an index is stored, avoiding any data transfer to the media.

Global deduplication relies on the same two‑level fingerprint architecture. SSDB resides on the client, while hash values are synchronized to the MA’s DDB. By sharing DDB, all clients achieve global deduplication, eliminating duplicate data across hundreds of clients, even across different storage systems and policies.

Parallel Media‑Server Deduplication

Because a single MA has limited performance and deduplication capacity, a group of MA servers and storage media can be clustered together, sharing the DDB and synchronizing fingerprints across the cluster. This forms a parallel deduplication domain that delivers higher throughput and concurrency, with built‑in automatic failover and load balancing.

As more MA nodes are added to the parallel deduplication grid, deduplication capacity, throughput, and concurrency increase linearly. The grid also provides automatic failover; if one MA node fails, the remaining nodes continue operation and take over its workload.

Data Replication Feature

Recovery uses the same hash index pointers to reconstruct the logical view, so no re‑assembly of deduplicated data is required, resulting in fast restore times.

SimPana also supports remote copy for disaster recovery. Non‑deduplicated copies are called Auxiliary Copy, while deduplicated copies are called Dash Copy, which is more widely used. The following focuses on the Dash Copy technology.

Under control of the backup management server, a copy task reads the hash values of the data to be copied from the source MA and sends them to the target MA. The target MA compares these hashes against its DDB; only new blocks are transferred, while existing blocks are represented by index records. The Vault Tracker feature enables restoration from the source side.

Hardware Snapshot IntelliSnap

Hardware snapshots provide the foundation for server‑free backup and enable storage‑application integration. Many storage vendors now partner with backup software vendors to offer snapshot‑based backups for critical applications such as DB2, MySQL, Oracle, SQL Server, and SAP with minimal impact.

Deployment involves installing iDA on the business server to identify applications, LUNs, and volumes, then issuing snapshot commands. Snapshots are indexed and mounted on the media server MA for backup. Combined with backup policies, this enables local deduplication, cross‑MA replication for off‑site disaster recovery, and long‑term retention.

AnyBackup Deduplication Principle

AnyBackup also uses a backup management server and media server architecture, but requires the backup and media servers to be co‑located and does not support independent deployment. It supports mainstream and domestic operating systems and databases such as Zhongbiao Kylin, RedFlag, Gbase, and DMDB.

AnyBackup’s deduplication works similarly to SimPana: the client first chunks data and computes hash fingerprints locally. If a fingerprint is not found locally, the client performs a second chunking and sends the hash to the backup server for lookup. If the hash exists, the data is deemed redundant and not stored on the media; otherwise, both data and fingerprint are transmitted and stored, updating both client‑side and server‑side fingerprint databases. AnyBackup employs two‑stage chunking and variable‑size blocks to achieve high deduplication ratios.

AnyBackup Remote Replication Technology

Currently AnyBackup does not support standalone media servers, so replication is limited to backup‑domain‑to‑backup‑domain copying. Within a single media server, global deduplication is available because the backup management and media server are deployed together.

AnyBackup Virtual Machine Instant Recovery

Instant recovery is a highlight of VM backup, allowing a virtual machine to be launched directly from the backup medium via its configuration file without first restoring the VM files to production storage, dramatically reducing failover time.

The process involves taking a snapshot of the VM, backing up the snapshot file with AnyBackup, and, when needed, the backup server parses the virtual disk file and presents it via NFS to ESXi for immediate startup. This instant‑recovery approach is also supported by Veeam, eBackup, AceSure, and other backup solutions, offering a significant advantage in VM backup scenarios.

Click to read the original article and visit the author's profile for more Q&A sessions.

Friendly Reminder: Please search for “ICT_Architect” or “Scan QR code” below to follow the public account and get more great content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deduplication backup compression

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.