How DPU Redefines Data Center Storage for AI and Cloud Workloads
This article analyzes the technical principles, architectural innovations, and real‑world scenarios of Data Processing Units (DPUs), showing how they resolve storage‑CPU mismatches, eliminate excessive east‑west traffic, and accelerate failure recovery, thereby becoming a core infrastructure for AI and cloud computing.
Technical Essence of DPU
A Data Processing Unit (DPU) is a standalone compute node that integrates network processing, storage offload, and security isolation. By taking over storage‑protocol handling, data verification, and replica synchronization, the DPU frees the host CPU for AI training, business logic, or other compute‑intensive workloads.
Typical high‑end DPUs such as NVIDIA BlueField‑3 provide:
Multi‑core compute: 16 × ARM Cortex‑A78 cores, capable of running a full OS and storage services (e.g., Ceph OSD, distributed RAID).
Specialised accelerators: hardware engines for encryption/decryption, compression, and erasure‑coding (EC). EC rebuild of 1 TB that takes hours on a CPU can be completed in minutes on a DPU with negligible CPU load.
High‑speed interconnect: 400 Gbps Ethernet or InfiniBand, plus NVMe‑over‑Fabrics (NVMe‑oF) with RDMA, delivering <10 µs latency for remote NVMe access (within 5 % of local performance).
Three‑Way Architectural Revolution
1. Resource Mismatch – Independent Scaling In traditional coupled architectures the CPU must be over‑provisioned for storage tasks, leading to <30 % CPU utilisation when I/O is low, while massive NVMe deployments become CPU‑bound. Decoupling compute and storage with DPUs allows:
Compute nodes to focus on business workloads.
DPUs to handle all storage logic.
Scalable storage capacity by simply adding NVMe SSDs to JBOF (just‑a‑box‑of‑flash) nodes, without upgrading CPUs.
Test results show CPU utilisation ↑ 40 %, NVMe SSD utilisation > 90 %, and ROI > 2×.
2. Traffic Flood – Near‑Zero East‑West Flow Conventional three‑replica storage generates 2 GB of replica traffic for every 1 GB of user data, consuming > 60 % of network bandwidth. DPUs perform replica creation locally in DPU memory, eliminating cross‑node traffic. In a 100‑node Ceph cluster, east‑west traffic drops to near zero, and write throughput improves by 17 % for three‑replica mode and 174 % for EC mode.
3. Failure Recovery – Minutes Instead of Hours Traditional rebuilds rely on the failed node’s CPU, limiting TB‑scale recovery to hours and degrading cluster performance > 50 %. DPUs use SR‑IOV to virtualise a failed NVMe SSD into 8‑16 virtual functions (VFs); each VF can be serviced by a different DPU, enabling parallel rebuild. In a 6‑disk RAID‑6 configuration, random read reaches 4.22 M IOPS and a 1 TB rebuild completes in 20 minutes with < 10 % performance impact.
Scenario‑Based DPU Solutions
1. CSAL QLC Acceleration QLC SSDs are cost‑effective (< 60 % of TLC price per TB) but suffer from poor random‑write performance and high write‑amplification factor (WAF ≈ 10). The Cloud Storage Acceleration Layer (CSAL) on the DPU implements a two‑tier cache:
Small random writes (e.g., 4 KB) are cached in TLC SSD or DPU DRAM.
When the cache accumulates to 64 KB–128 KB, the DPU flushes the data sequentially to the QLC SSD.
Results: 4 KB random‑write speed ↑ 20×, WAF ↓ from 10× to 1.2×, SSD lifespan ↑ 3×, and overall hardware cost ↓ 40 % for AI checkpoint workloads.
2. SR‑IOV Storage Virtualisation Ceph performance scales with the number of OSDs. By applying SR‑IOV, each NVMe SSD can expose 8 VFs, each mapped to a separate OSD. Example deployment:
JBOF server: 8 × Samsung PM1743 NVMe SSDs
BlueField‑3 DPUs: 8 nodes (one per SSD)
Total OSDs: 8 SSD × 8 VF = 64 OSDsMeasured random read throughput reaches 32.46 GB/s – a 176 % improvement over a traditional 3‑node x86 solution – while hardware cost drops by 50 %. Adjusting Ceph CRUSH rules to keep replicas and EC blocks on the same DPU eliminates cross‑node traffic, further boosting write performance.
3. Distributed RAID (XiRaid) Traditional distributed storage protects data at the file or object layer (e.g., three‑replica, EC), incurring complexity and performance loss. XiRaid moves RAID logic to the block layer and offloads it to DPUs, enabling a single‑replica layout with EC‑level reliability. In a 6‑disk RAID‑6 configuration:
Random read: 4.22 M IOPS
Sequential write: 36 GB/s (≈ 60 % faster than classic EC)
Storage utilisation: > 90 % (vs. 33 % for three‑replica)
No changes to the upper‑layer storage software (Ceph, MinIO, HDFS) are required; the DPU presents a standard block device.
Future Outlook
DPUs are evolving toward an “intelligent offload” era:
AI‑driven scheduling: Predictive caching based on historical I/O patterns, proactive fault detection, and data migration.
Heterogeneous collaboration: Tight integration with GPUs/FPGA – DPUs pre‑process and stage data, GPUs execute model inference/training.
Edge deployment: Low‑power DPUs will enable storage‑compute disaggregation at the edge for industrial IoT, autonomous driving, and other latency‑sensitive workloads.
These trends transform storage from a passive support component into an active, programmable resource that underpins AI, cloud, and edge computing at scale.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
