Fundamentals 5 min read

Achieving Up to 3× Speedup for dma_map_sg, dma_unmap_sg and DMA Sync on Arm64

The article explains how batching cache‑sync operations for dma_map_sg, dma_unmap_sg and dma_sync_sg on arm64 can cut their execution time by up to three‑fold, details the kernel patches introduced, and presents benchmark results on Dimensity 9500 and RK3588 platforms.

Linux Code Review Hub
Linux Code Review Hub
Linux Code Review Hub
Achieving Up to 3× Speedup for dma_map_sg, dma_unmap_sg and DMA Sync on Arm64

Problem

On systems without DMA‑coherence, dma_map_sg(), dma_unmap_sg(), dma_sync_sg_for_device() and dma_sync_sg_for_cpu() perform a cache invalidate/clean for every scatter‑gather (sg) entry. When a list contains thousands of entries (e.g., 10 000), the kernel issues a separate dc instruction followed by a dsb for each entry, creating hot spots in flame graphs and noticeable latency.

Architectural Insight

ARM64 allows the cache‑sync instructions ( dc) to be issued for all entries first and then a single dsb to wait for completion, as described in the ARM specification. The same batching principle was previously applied to TLBI in the mainline patch “arm64: support batched/deferred tlb shootdown during page reclamation/migration” (

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43b3dfdd04553171488cb11d46d21948b6b90e27

).

Patch Series

A six‑patch series titled “dma‑mapping: arm64: support batched cache sync” introduces three architecture‑specific callbacks. When the DMA‑mapping core detects arch support, it processes the first n sg entries in batch and finalises the batch with sync_dma_batch_flush().

Benchmark Results

On a MediaTek Dimensity 9500, dma_map_sg() time decreased by 64.61 % and dma_unmap_sg() time decreased by 66.60 % .

Running the same patches on an RK3588 Rock5B+ with Linux 6.19‑rc1 showed extensive batch processing of DMA cache syncs.

Build Instructions

Place the patches in the Armbian userpatch directory and compile with:

./compile.sh BOARD=rock-5b-plus BRANCH=edge KERNELSOURCE='https://github.com/torvalds/linux' KERNELBRANCH='branch:master' kernel

Patch Set Location

Full patch series:

https://lore.kernel.org/lkml/CAGsJ_4yKeUHgxRJJHiOcdaVcV1pjeHRjbybvEs5YLm=AJoe-Dw@mail.gmail.com/

. Patch 6 (dma‑iommu) enables DMA sync batching for IOVA link/unlink and is marked RFC, inviting testing on compatible hardware.

Related Optimisations

DMA‑buf mmap optimisation (potential 35× speedup) merged upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04c7adb5871ad04c9e3fd645570e21c93f1b2f54

DMA‑buf vmap optimisation (potential 17× speedup) merged upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=4412bae228651f5ab887ba971a47cc4f4bae234d

Current cache‑sync batch provides a potential 3× speedup, completing a relatively complete DMA‑buf optimisation chain.

Performance optimizationDMALinux kernelARM64cache synchronizationpatchset
Linux Code Review Hub
Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.