Achieving Up to 3× Speedup for dma_map_sg, dma_unmap_sg and DMA Sync on Arm64
The article explains how batching cache‑sync operations for dma_map_sg, dma_unmap_sg and dma_sync_sg on arm64 can cut their execution time by up to three‑fold, details the kernel patches introduced, and presents benchmark results on Dimensity 9500 and RK3588 platforms.
Problem
On systems without DMA‑coherence, dma_map_sg(), dma_unmap_sg(), dma_sync_sg_for_device() and dma_sync_sg_for_cpu() perform a cache invalidate/clean for every scatter‑gather (sg) entry. When a list contains thousands of entries (e.g., 10 000), the kernel issues a separate dc instruction followed by a dsb for each entry, creating hot spots in flame graphs and noticeable latency.
Architectural Insight
ARM64 allows the cache‑sync instructions ( dc) to be issued for all entries first and then a single dsb to wait for completion, as described in the ARM specification. The same batching principle was previously applied to TLBI in the mainline patch “arm64: support batched/deferred tlb shootdown during page reclamation/migration” (
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43b3dfdd04553171488cb11d46d21948b6b90e27).
Patch Series
A six‑patch series titled “dma‑mapping: arm64: support batched cache sync” introduces three architecture‑specific callbacks. When the DMA‑mapping core detects arch support, it processes the first n sg entries in batch and finalises the batch with sync_dma_batch_flush().
Benchmark Results
On a MediaTek Dimensity 9500, dma_map_sg() time decreased by 64.61 % and dma_unmap_sg() time decreased by 66.60 % .
Running the same patches on an RK3588 Rock5B+ with Linux 6.19‑rc1 showed extensive batch processing of DMA cache syncs.
Build Instructions
Place the patches in the Armbian userpatch directory and compile with:
./compile.sh BOARD=rock-5b-plus BRANCH=edge KERNELSOURCE='https://github.com/torvalds/linux' KERNELBRANCH='branch:master' kernelPatch Set Location
Full patch series:
https://lore.kernel.org/lkml/CAGsJ_4yKeUHgxRJJHiOcdaVcV1pjeHRjbybvEs5YLm=AJoe-Dw@mail.gmail.com/. Patch 6 (dma‑iommu) enables DMA sync batching for IOVA link/unlink and is marked RFC, inviting testing on compatible hardware.
Related Optimisations
DMA‑buf mmap optimisation (potential 35× speedup) merged upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04c7adb5871ad04c9e3fd645570e21c93f1b2f54DMA‑buf vmap optimisation (potential 17× speedup) merged upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=4412bae228651f5ab887ba971a47cc4f4bae234dCurrent cache‑sync batch provides a potential 3× speedup, completing a relatively complete DMA‑buf optimisation chain.
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
