Backend Development 16 min read

Why DPDK Memory Allocation Slows Down and How to Fix It

This article investigates the performance degradation caused by DPDK memory fragmentation, explains the underlying heap and element management, presents detailed profiling and test results, and proposes two practical solutions that dramatically reduce allocation latency and improve overall system throughput.

ByteDance SYS Tech

May 19, 2023

Why DPDK Memory Allocation Slows Down and How to Fix It

Background

DPDK (Data Plane Development Kit) provides high‑performance packet processing and uses a custom memory management interface based on huge pages. SPDK (Storage Performance Development Kit) builds on DPDK for storage workloads. Common DPDK memory APIs such as rte_malloc, rte_realloc and rte_free are heavily used in the I/O path.

Problem Observation

When allocating many 4 KB objects in a single thread, the average allocation time grows linearly with the number of allocations, contrary to the expectation that each 4 KB allocation should take roughly the same time.

uint64_t entry_time, time;
size_t size = 4096;
unsigned align = 4096;
for (int j = 0; j < 10; j++) {
    entry_time = rte_get_timer_cycles();
    for (int i = 0; i < 2000; i++) {
        rte_malloc(NULL, size, align);
    }
    time = (rte_get_timer_cycles() - entry_time) * 1000000 / rte_get_timer_hz();
    printf("total open time %lu us, avg time %lu us
", time, time/2000);
}

The table and flame‑graph show that the function find_suitable_element dominates the cost, traversing a growing list of elements in free_head[2].

Analysis

DPDK maintains a per‑NUMA‑node heap. Each contiguous memory block is represented by a malloc_elem object and inserted into one of 13 freelists ( free_head[0] … free_head[12]) based on size ranges (e.g., 0‑256 B, 256‑1024 B, 1024‑4096 B, …). Allocation proceeds by:

Selecting the heap according to the current CPU.

Choosing the freelist that matches the requested size.

Scanning the list to find an element that satisfies size and alignment, then splitting it.

When a large element is split to satisfy a 4 KB request with a 4 KB alignment, the remaining tail (often ~2.4 KB) is inserted back into free_head[2]. Repeated allocations therefore increase the number of small elements in that list, making each subsequent scan longer.

Verification

Changing the alignment from 4096 bytes to 64 bytes eliminates the tail insertion, and the average allocation time stabilises around 0.7‑0.8 µs without linear growth.

Solutions

Solution 1 : Extend struct malloc_heap with an array free_head_max_size that records the maximum element size in each freelist. During allocation, if the requested size exceeds this maximum, the list can be skipped entirely.

struct malloc_heap {
    rte_spinlock_t lock;
    LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS];
    size_t free_head_max_size[RTE_HEAP_NUM_FREELISTS];
    ...
};

Benchmarks show allocation latency stabilises around 0.5 µs, a >30× improvement over the original path.

Solution 2 : Change the freelist selection rule to start the search in free_head[3] (size range 4 KB‑16 KB) instead of free_head[2]. This reduces the number of small elements examined and yields similar performance gains, with the added benefit of better handling boundary sizes such as 16 KB and 64 KB.

Testing & Results

Extensive single‑thread and multi‑thread benchmarks (up to 16 threads) were performed for various allocation sizes (64 B‑524 288 B) and three test patterns (bulk, no‑bulk, realloc). Both solutions eliminated the linear latency increase and provided 15‑50 % speed‑up for 4 KB, 64 KB and 1 MB allocations, while also reducing lock contention in multi‑process scenarios.

Conclusion

DPDK memory fragmentation is caused by large alignment requirements that generate many small tail elements in free_head[2]. By either tracking maximum element sizes per freelist or by shifting the search to a larger freelist, the fragmentation effect is mitigated, yielding up to 30× latency reduction in the worst case and consistent 15‑50 % improvements in normal workloads. The patch has been submitted to the DPDK community and merged into the mainline in February 2023.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

C#malloc Memory Fragmentation DPDK

Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.