Backend Development 13 min read

Why Loop Order Matters: Boost Java Matrix Multiplication Speed by 100×

This article demonstrates how reorganizing Java matrix‑multiplication loops and understanding Java 2‑D array storage and CPU cache hierarchies can turn a naïve implementation into a version that runs up to a hundred times faster, backed by JMH benchmark results.

Alibaba Cloud Developer

Mar 7, 2023

Why Loop Order Matters: Boost Java Matrix Multiplication Speed by 100×

The article shows that carefully organizing code can improve performance by orders of magnitude, using a matrix‑multiplication example in Java to illustrate the impact of loop ordering on CPU cache utilization.

Shock: Code Order Affects Performance

Three different loop orderings (A, B, C) are compared, and benchmark results reveal a clear performance hierarchy.

Why Performance Differs

Two essential knowledge points are required:

Java two‑dimensional array storage layout.

CPU cache hierarchy (L1, L2, L3, main memory) and its latency.

Java stores each row of a 2‑D array as a contiguous block of memory, but rows themselves are not guaranteed to be adjacent. The CPU cache consists of multiple levels with increasing size and latency, as shown in the diagrams.

Friendly vs. Unfriendly Traversal

A horizontal‑first (row‑major) traversal accesses memory in the order it is laid out, yielding a high cache‑hit rate (≈66.7%). A vertical‑first traversal forces the CPU to load a new row for each element, resulting in near‑zero cache hits.

Benchmark Results

Running JMH on a MacBook Pro (Intel i5, 4 cores, 16 GB RAM) produced the following average times (seconds) for N=1000:

Code A_ijk: 0.158 → 2.78; Code A_jik: 0.151 → 2.97; Code B_jki: 0.338 → 14.19; Code B_kji: 0.338 → 13.82; Code C_kij: 0.013 → 0.198; Code C_ikj: 0.016 → 0.251.

Conclusion

Because Java 2‑D arrays store rows contiguously but not necessarily adjacent to each other, accessing elements row‑wise (left‑to‑right, top‑to‑bottom) aligns with cache lines and maximizes cache reuse, while column‑wise access incurs many cache misses and can be up to a hundred times slower. The same principle applies to other languages.

import org.openjdk.jmh.annotations.*;
/**
 * 矩阵 C = AB 的计算
 * @author wzj
 * @date 2023/02/09
 */
@BenchmarkMode(Mode.AverageTime)
@State(value = Scope.Benchmark)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 10, time = 1)
public class ArrayTestBenchmark {
    private final int N = 1000;
    private final int[][] arrays_A = new int[N][N];
    private final int[][] arrays_B = new int[N][N];
    @Setup
    public void setUp() {
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                arrays_A[i][j] = i + j;
                arrays_B[i][j] = i + j;
            }
        }
    }
    @Benchmark
    public void ijk() {
        final int[][] arrays_C = new int[N][N];
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                int sum = 0;
                for (int k = 0; k < N; k++) {
                    sum += arrays_A[i][k] * arrays_B[k][j];
                }
                arrays_C[i][j] += sum;
            }
        }
        assert arrays_C.length > 0;
    }
    @Benchmark
    public void jik() {
        final int[][] arrays_C = new int[N][N];
        for (int j = 0; j < N; j++) {
            for (int i = 0; i < N; i++) {
                int sum = 0;
                for (int k = 0; k < N; k++) {
                    sum += arrays_A[i][k] * arrays_B[k][j];
                }
                arrays_C[i][j] += sum;
            }
        }
        assert arrays_C.length > 0;
    }
    @Benchmark
    public void jki() {
        final int[][] arrays_C = new int[N][N];
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                int r_B = arrays_B[k][j];
                for (int i = 0; i < N; i++) {
                    arrays_C[i][j] += arrays_A[i][k] * r_B;
                }
            }
        }
        assert arrays_C.length > 0;
    }
    @Benchmark
    public void kji() {
        final int[][] arrays_C = new int[N][N];
        for (int k = 0; k < N; k++) {
            for (int j = 0; j < N; j++) {
                int r_B = arrays_B[k][j];
                for (int i = 0; i < N; i++) {
                    arrays_C[i][j] += arrays_A[i][k] * r_B;
                }
            }
        }
        assert arrays_C.length > 0;
    }
    @Benchmark
    public void kij() {
        final int[][] arrays_C = new int[N][N];
        for (int k = 0; k < N; k++) {
            for (int i = 0; i < N; i++) {
                int r_A = arrays_A[k][i];
                for (int j = 0; j < N; j++) {
                    arrays_C[i][j] += r_A * arrays_B[k][j];
                }
            }
        }
        assert arrays_C.length > 0;
    }
    @Benchmark
    public void ikj() {
        final int[][] arrays_C = new int[N][N];
        for (int i = 0; i < N; i++) {
            for (int k = 0; k < N; k++) {
                int r_A = arrays_A[k][i];
                for (int j = 0; j < N; j++) {
                    arrays_C[i][j] += r_A * arrays_B[k][j];
                }
            }
        }
        assert arrays_C.length > 0;
    }
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance benchmark Matrix multiplication CPU cache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.