How DPRD Improves Data Placement Over CRUSH in Distributed Storage
This article introduces the DPRD hierarchical data placement strategy, explains its roots in the CRUSH algorithm, highlights CRUSH's migration inefficiencies, and details how DPRD achieves near‑theoretical replica movement and balanced distribution during both expansion and shrinkage of a distributed storage system.
Introduction
The paper proposes a hierarchical data placement strategy called DPRD, applied to distributed storage systems such as Zeppelin, derived from the CRUSH algorithm and optimized for expansion and shrinkage scenarios.
CRUSH Overview
CRUSH is the data distribution algorithm used by Ceph. It treats physical devices as buckets (root, rack, cabinet, disk) forming a layered, fault‑domain‑aware hierarchy.
CRUSH Migration Strategy
CRUSH migration involves receiving map updates, recomputing PG‑OSD mappings, and determining migration directions based on differences between old and new maps (e.g., moving a replica from OSD 3 to OSD 4).
Problems with CRUSH
Two main issues are identified: (1) the default straw bucket can cause 2–3× the theoretical migration rate, and (2) the algorithm does not guarantee balanced PG distribution, especially when each node holds few PGs.
DPRD Goals
Target (under hierarchical topology with fault‑domain separation): 1. Replica movement rate should approach the theoretical minimum. 2. Replica distribution should be as uniform as possible. Service phases: 1. Initial replica placement. 2. Redistribution during expansion. 3. Redistribution during shrinkage.
DPRD Strategy
In DPRD, a Factor is defined as partition / weight , representing the number of partitions per unit weight of a bucket. Base Factor equals 1/weight, and Average Factor is the sum of partitions divided by the sum of weights, used to assess balance during scaling.
DPRD’s select N differs from CRUSH: it computes each child bucket’s Factor and selects the N buckets with the smallest Factor, ensuring each partition is placed on the lightest bucket at that level.
Selection Process
Step 1: From the overloaded rack (e.g., rack 5), choose the host whose Factor deviates most positively from the average (e.g., host 52). Step 2: Recursively descend to select a child bucket until a node bucket is reached. Step 3: Within that node bucket, pick a partition not already present in the under‑loaded rack (e.g., rack 2) and move it there. If no such partition exists, repeat Step 1 with the next most deviating host.
Balancing and Unbalance Definition
An bucket is considered unbalanced when its Factor exceeds the average factor by more than the base factor, or falls below the average by more than the base factor.
Overall Balancing Process
DPRD balances from the root downwards: after equalizing the rack layer, it proceeds to the host layer, and continues recursively until the entire tree is balanced.
Implementation for Expansion
When adding a bucket, DPRD may cause temporary imbalance; it selects the most overloaded host, then recursively chooses a child bucket and moves a partition from the overloaded side to the under‑loaded side, repeating as needed.
Implementation for Shrinkage
Instead of removing all replicas of a bucket (which would cause excessive movement), DPRD removes only the necessary replicas and, in the choose_n layer, selects a new bucket without that partition to place a replacement, following a four‑step procedure to record, delete, select, and re‑initialize the partition.
Conclusion
The DPRD strategy maintains fault‑domain separation while achieving near‑uniform data distribution and minimizing replica movement during both scaling up and down.
References
GitHub pull request (https://github.com/Qihoo360/zeppelin-client/pull/21)
CRUSH paper (https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf)
Zeppelin source code (https://github.com/Qihoo360/zeppelin-client)
Discussion on distributed storage data placement methods (http://catkang.github.io/2017/12/17/data-placement.html)
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.