Fundamentals 19 min read

Why a Mathematically Free Matrix Transpose Is 10× Slower on CPUs

A naive 1024×1024 matrix transpose can be ten times slower than an optimized version because the CPU sees memory as a linear address space, and row‑major layout combined with cache‑line granularity makes column‑wise accesses incur massive cache misses, which can be eliminated with blocking, prefetching and SIMD techniques.

IT Services Circle
IT Services Circle
IT Services Circle
Why a Mathematically Free Matrix Transpose Is 10× Slower on CPUs

Naïve transpose

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        B[j][i] = A[i][j];
    }
}

On a 1024×1024 matrix this version takes about 31 ms, while a cache‑friendly version can finish in about 3 ms – roughly a ten‑fold speedup.

Memory layout and cache line

C stores multidimensional arrays in row‑major order, so a 4×4 int matrix occupies consecutive addresses 0‑15. A CPU cache line is 64 bytes (16 int values). When a line is fetched, the whole 64 bytes are placed in L1 cache.

Row‑wise vs column‑wise access

Reading a matrix row‑wise (inner loop over columns) accesses memory sequentially, giving high cache‑line reuse (≈75 % hit rate in the 4‑element example). Reading column‑wise (as the naïve transpose does) jumps N elements each step, causing a new cache line for every element (≈0 % hit rate, 75 % of transferred bytes wasted).

Matrix transpose memory layout illustration
Matrix transpose memory layout illustration

Cache‑miss cost

A single memory access costs ~100 cycles, while an integer multiply costs 1 cycle. Consequently, cache misses dominate execution time; in a 1024×1024 transpose the CPU spends >95 % of its cycles waiting for data.

Prefetcher

The hardware prefetcher can predict regular access patterns (sequential or fixed‑stride) and fetch the next cache line ahead of time, but it cannot help with the irregular column‑wise jumps of the naïve transpose.

Blocked (tiled) transpose

Dividing the matrix into small tiles that fit into L1 cache eliminates column‑wise jumps inside each tile. A typical tile size is 8×8 (256 bytes, four cache lines) for a 64‑byte cache line and a 32 KB L1 cache.

#define BLOCK 8
for (int ii = 0; ii < N; ii += BLOCK) {
    for (int jj = 0; jj < N; jj += BLOCK) {
        for (int i = ii; i < ii + BLOCK; i++) {
            for (int j = jj; j < jj + BLOCK; j++) {
                B[j][i] = A[i][j];
            }
        }
    }
}

This approach reduces the naïve 31 ms to 5 ms (≈6× faster). Adding SIMD vectorisation halves the time again, reaching ≈3 ms (≈10× overall speedup).

Benchmark results (1024×1024 double matrix)

naïve transpose – 31.4 ms (1×)

blocked 8×8 – 5.2 ms (6×)

blocked 8×8 + SIMD – 3.1 ms (10×)

blocked 32×32 (L2‑aware) – 2.8 ms (11×)

Benchmark bar chart
Benchmark bar chart

Why tile size matters

If the tile is too small, the overhead of switching tiles dominates; if it is too large the tile no longer fits in cache and the same miss problem reappears. The optimal size depends on the cache hierarchy of the target CPU.

SIMD interaction

SIMD (e.g., an 8‑wide vector unit) can only operate efficiently on contiguous data. Inside a cache‑friendly tile the data are contiguous, so SIMD can be applied. In the naïve column‑wise case SIMD cannot help because the accesses are non‑contiguous.

GPU considerations

GPU shared memory is analogous to L1 cache but much smaller, and bank conflicts add another complication. CUDA transpose kernels therefore use tiles sized 32×33 to avoid bank conflicts while keeping the same cache‑friendly principle.

Practical takeaways

Memory is a linear address space laid out row‑major.

CPU moves data in 64‑byte cache lines and expects spatial locality.

Naïve transpose reads columns, causing ~75 % of each cache line to be wasted.

Prefetchers and SIMD cannot rescue column‑wise jumps.

Blocking the matrix so that each tile fits in L1 cache restores locality, unlocking cache, prefetcher and SIMD – up to 10× faster.

The same cache‑friendly principle explains performance gaps in matrix multiplication, convolutions, deep‑learning kernels (BLAS, cuDNN, oneDNN) and high‑level operations such as tensor.contiguous() in PyTorch or view‑based transposes in NumPy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance tuningSIMDcache optimizationCPU architectureblocked algorithmmatrix transpose
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.