Why DPDK Memory Allocation Slows Down and How to Fix It
This article investigates the performance degradation caused by DPDK memory fragmentation, explains the underlying heap and element management, presents detailed profiling and test results, and proposes two practical solutions that dramatically reduce allocation latency and improve overall system throughput.
Background
DPDK (Data Plane Development Kit) provides high‑performance packet processing and uses a custom memory management interface based on huge pages. SPDK (Storage Performance Development Kit) builds on DPDK for storage workloads. Common DPDK memory APIs such as rte_malloc , rte_realloc and rte_free are heavily used in the I/O path.
Problem Observation
When allocating many 4 KB objects in a single thread, the average allocation time grows linearly with the number of allocations, contrary to the expectation that each 4 KB allocation should take roughly the same time.
<code>uint64_t entry_time, time;
size_t size = 4096;
unsigned align = 4096;
for (int j = 0; j < 10; j++) {
entry_time = rte_get_timer_cycles();
for (int i = 0; i < 2000; i++) {
rte_malloc(NULL, size, align);
}
time = (rte_get_timer_cycles() - entry_time) * 1000000 / rte_get_timer_hz();
printf("total open time %lu us, avg time %lu us\n", time, time/2000);
}
</code>The table and flame‑graph show that the function find_suitable_element dominates the cost, traversing a growing list of elements in free_head[2] .
Analysis
DPDK maintains a per‑NUMA‑node heap. Each contiguous memory block is represented by a malloc_elem object and inserted into one of 13 freelists ( free_head[0] … free_head[12] ) based on size ranges (e.g., 0‑256 B, 256‑1024 B, 1024‑4096 B, …). Allocation proceeds by:
Selecting the heap according to the current CPU.
Choosing the freelist that matches the requested size.
Scanning the list to find an element that satisfies size and alignment, then splitting it.
When a large element is split to satisfy a 4 KB request with a 4 KB alignment, the remaining tail (often ~2.4 KB) is inserted back into free_head[2] . Repeated allocations therefore increase the number of small elements in that list, making each subsequent scan longer.
Verification
Changing the alignment from 4096 bytes to 64 bytes eliminates the tail insertion, and the average allocation time stabilises around 0.7‑0.8 µs without linear growth.
Solutions
Solution 1 : Extend struct malloc_heap with an array free_head_max_size that records the maximum element size in each freelist. During allocation, if the requested size exceeds this maximum, the list can be skipped entirely.
<code>struct malloc_heap {
rte_spinlock_t lock;
LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS];
size_t free_head_max_size[RTE_HEAP_NUM_FREELISTS];
...
};
</code>Benchmarks show allocation latency stabilises around 0.5 µs, a >30× improvement over the original path.
Solution 2 : Change the freelist selection rule to start the search in free_head[3] (size range 4 KB‑16 KB) instead of free_head[2] . This reduces the number of small elements examined and yields similar performance gains, with the added benefit of better handling boundary sizes such as 16 KB and 64 KB.
Testing & Results
Extensive single‑thread and multi‑thread benchmarks (up to 16 threads) were performed for various allocation sizes (64 B‑524 288 B) and three test patterns (bulk, no‑bulk, realloc). Both solutions eliminated the linear latency increase and provided 15‑50 % speed‑up for 4 KB, 64 KB and 1 MB allocations, while also reducing lock contention in multi‑process scenarios.
Conclusion
DPDK memory fragmentation is caused by large alignment requirements that generate many small tail elements in free_head[2] . By either tracking maximum element sizes per freelist or by shifting the search to a larger freelist, the fragmentation effect is mitigated, yielding up to 30× latency reduction in the worst case and consistent 15‑50 % improvements in normal workloads. The patch has been submitted to the DPDK community and merged into the mainline in February 2023.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.