Operations 11 min read

How DPRD Improves Data Placement Over CRUSH in Distributed Storage

This article introduces the DPRD hierarchical data placement strategy, explains its roots in the CRUSH algorithm, highlights CRUSH's migration inefficiencies, and details how DPRD achieves near‑theoretical replica movement and balanced distribution during both expansion and shrinkage of a distributed storage system.

360 Zhihui Cloud Developer

Apr 8, 2018

How DPRD Improves Data Placement Over CRUSH in Distributed Storage

Introduction

The paper proposes a hierarchical data placement strategy called DPRD, applied to distributed storage systems such as Zeppelin, derived from the CRUSH algorithm and optimized for expansion and shrinkage scenarios.

CRUSH Overview

CRUSH is the data distribution algorithm used by Ceph. It treats physical devices as buckets (root, rack, cabinet, disk) forming a layered, fault‑domain‑aware hierarchy.

CRUSH Migration Strategy

CRUSH migration involves receiving map updates, recomputing PG‑OSD mappings, and determining migration directions based on differences between old and new maps (e.g., moving a replica from OSD 3 to OSD 4).

Problems with CRUSH

Two main issues are identified: (1) the default straw bucket can cause 2–3× the theoretical migration rate, and (2) the algorithm does not guarantee balanced PG distribution, especially when each node holds few PGs.

DPRD Goals

Target (under hierarchical topology with fault‑domain separation): 1. Replica movement rate should approach the theoretical minimum. 2. Replica distribution should be as uniform as possible. Service phases: 1. Initial replica placement. 2. Redistribution during expansion. 3. Redistribution during shrinkage.

DPRD Strategy

In DPRD, a Factor is defined as partition / weight , representing the number of partitions per unit weight of a bucket. Base Factor equals 1/weight, and Average Factor is the sum of partitions divided by the sum of weights, used to assess balance during scaling.

DPRD’s select N differs from CRUSH: it computes each child bucket’s Factor and selects the N buckets with the smallest Factor, ensuring each partition is placed on the lightest bucket at that level.

Selection Process

Step 1: From the overloaded rack (e.g., rack 5), choose the host whose Factor deviates most positively from the average (e.g., host 52). Step 2: Recursively descend to select a child bucket until a node bucket is reached. Step 3: Within that node bucket, pick a partition not already present in the under‑loaded rack (e.g., rack 2) and move it there. If no such partition exists, repeat Step 1 with the next most deviating host.

Balancing and Unbalance Definition

An bucket is considered unbalanced when its Factor exceeds the average factor by more than the base factor, or falls below the average by more than the base factor.

Overall Balancing Process

DPRD balances from the root downwards: after equalizing the rack layer, it proceeds to the host layer, and continues recursively until the entire tree is balanced.

Implementation for Expansion

When adding a bucket, DPRD may cause temporary imbalance; it selects the most overloaded host, then recursively chooses a child bucket and moves a partition from the overloaded side to the under‑loaded side, repeating as needed.

Implementation for Shrinkage

Instead of removing all replicas of a bucket (which would cause excessive movement), DPRD removes only the necessary replicas and, in the choose_n layer, selects a new bucket without that partition to place a replacement, following a four‑step procedure to record, delete, select, and re‑initialize the partition.

Conclusion

The DPRD strategy maintains fault‑domain separation while achieving near‑uniform data distribution and minimizing replica movement during both scaling up and down.

References

GitHub pull request (https://github.com/Qihoo360/zeppelin-client/pull/21)

CRUSH paper (https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf)

Zeppelin source code (https://github.com/Qihoo360/zeppelin-client)

Discussion on distributed storage data placement methods (http://catkang.github.io/2017/12/17/data-placement.html)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

load balancing Distributed storage Ceph CRUSH algorithm data placement DPRD strategy

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.