Fundamentals 5 min read

Achieving Up to 3× Speedup for dma_map_sg, dma_unmap_sg and DMA Sync on Arm64

The article explains how batching cache‑sync operations for dma_map_sg, dma_unmap_sg and dma_sync_sg on arm64 can cut their execution time by up to three‑fold, details the kernel patches introduced, and presents benchmark results on Dimensity 9500 and RK3588 platforms.

Linux Code Review Hub

Dec 21, 2025

Achieving Up to 3× Speedup for dma_map_sg, dma_unmap_sg and DMA Sync on Arm64

Problem

On systems without DMA‑coherence, dma_map_sg(), dma_unmap_sg(), dma_sync_sg_for_device() and dma_sync_sg_for_cpu() perform a cache invalidate/clean for every scatter‑gather (sg) entry. When a list contains thousands of entries (e.g., 10 000), the kernel issues a separate dc instruction followed by a dsb for each entry, creating hot spots in flame graphs and noticeable latency.

Architectural Insight

ARM64 allows the cache‑sync instructions ( dc) to be issued for all entries first and then a single dsb to wait for completion, as described in the ARM specification. The same batching principle was previously applied to TLBI in the mainline patch “arm64: support batched/deferred tlb shootdown during page reclamation/migration” (

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43b3dfdd04553171488cb11d46d21948b6b90e27

Patch Series

A six‑patch series titled “dma‑mapping: arm64: support batched cache sync” introduces three architecture‑specific callbacks. When the DMA‑mapping core detects arch support, it processes the first n sg entries in batch and finalises the batch with sync_dma_batch_flush().

Benchmark Results

On a MediaTek Dimensity 9500, dma_map_sg() time decreased by 64.61 % and dma_unmap_sg() time decreased by 66.60 % .

Running the same patches on an RK3588 Rock5B+ with Linux 6.19‑rc1 showed extensive batch processing of DMA cache syncs.

Build Instructions

Place the patches in the Armbian userpatch directory and compile with:

./compile.sh BOARD=rock-5b-plus BRANCH=edge KERNELSOURCE='https://github.com/torvalds/linux' KERNELBRANCH='branch:master' kernel

Patch Set Location

Full patch series:

https://lore.kernel.org/lkml/CAGsJ_4yKeUHgxRJJHiOcdaVcV1pjeHRjbybvEs5YLm=AJoe-Dw@mail.gmail.com/

. Patch 6 (dma‑iommu) enables DMA sync batching for IOVA link/unlink and is marked RFC, inviting testing on compatible hardware.

Related Optimisations

DMA‑buf mmap optimisation (potential 35× speedup) merged upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04c7adb5871ad04c9e3fd645570e21c93f1b2f54

DMA‑buf vmap optimisation (potential 17× speedup) merged upstream:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-new&id=4412bae228651f5ab887ba971a47cc4f4bae234d

Current cache‑sync batch provides a potential 3× speedup, completing a relatively complete DMA‑buf optimisation chain.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization DMA Linux kernel ARM64 cache synchronization patchset

Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.