Fundamentals 11 min read

Why Row‑Major Access Beats Column‑Major in C: A Cache‑Level Deep Dive

The article explains why iterating a two‑dimensional array by rows runs dramatically faster than by columns, covering memory‑hierarchy basics, locality principles, cache behavior, cache‑line mechanics, and Linux perf measurements that reveal a 20‑fold speed gap caused by cache‑miss rates.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Why Row‑Major Access Beats Column‑Major in C: A Cache‑Level Deep Dive

Introduction

This piece explores a common performance puzzle: accessing a 2‑D array row‑wise in C can be up to twenty times faster than column‑wise. It does so by presenting two tiny test programs, then diving into the underlying hardware concepts that cause the disparity.

Two Simple Test Programs

Both programs allocate an equally sized two‑dimensional array and assign values in a double loop. array1.c walks the array row by row, while array2.c walks it column by column. After compilation, the time command shows 0.528 s for array1 and 10.310 s for array2, a nearly 20× difference.

Storage Pyramid

The modern computer memory hierarchy resembles a pyramid: registers at the top (fastest, smallest), followed by multiple cache levels (L1, L2, L3), main memory, and finally large but slow storage such as disks or network drives. Each higher level acts as a cache for the level below it.

Storage pyramid diagram
Storage pyramid diagram

The principle that makes this hierarchy effective is the principle of locality : programs tend to reuse recently accessed data (temporal locality) and data that is close in address (spatial locality).

Principle of Locality

Temporal locality : If a datum is accessed now, it is likely to be accessed again soon.

Spatial locality : If a datum is accessed, nearby addresses are likely to be accessed shortly thereafter.

Cache Basics

CPU speed far exceeds memory speed, so caches (L1, L2, L3) sit between the CPU and main memory to bridge the gap. A cache hit (data found in cache) yields fast access; a miss forces a slower memory fetch. L1 is the smallest and fastest, L3 the largest and slowest of the three.

Cache hierarchy diagram
Cache hierarchy diagram

Cache Line

A cache line is the smallest unit transferred between memory and cache, typically 64 bytes. When a datum is accessed, the whole line containing it is loaded into cache, allowing subsequent accesses to nearby data to hit the cache.

Cache‑Line Example

Consider a machine with 64‑byte cache lines and the following array:

int a[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};

If the array starts on a cache‑line boundary and we read a[5] for the first time, the CPU misses in the cache, loads the entire 64‑byte line (the whole array) from memory, and stores it in cache. int x = a[5]; A subsequent read of a[10] hits the cache because the data is already present.

int y = a[10];

Why Row‑Major Beats Column‑Major

In C, multidimensional arrays are stored in row‑major order, meaning consecutive elements of a row are adjacent in memory. array1.c accesses elements sequentially, so after the first cache miss each subsequent access hits the same cache line.

Conversely, array2.c accesses elements column‑wise, jumping a stride of 10240 × 4 bytes between accesses. Each access lands in a different cache line, causing a miss each time and evicting previously loaded lines. This defeats both temporal and spatial locality, leading to a massive number of cache misses.

Observing the Difference with perf

The Linux perf tool measures hardware events. Running perf stat on the two programs shows that array1.c has an L1‑data‑cache‑miss rate of only 3.10 % , while array2.c suffers a staggering 93.03 % miss rate, directly explaining the 20× runtime gap.

perf stats for array1
perf stats for array1
perf stats for array2
perf stats for array2

Conclusion

The dramatic performance gap between row‑wise and column‑wise traversal stems from cache behavior governed by the principle of locality. Writing cache‑friendly code—accessing memory sequentially and respecting cache‑line boundaries—can yield order‑of‑magnitude speedups without any algorithmic changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceCacheC programmingMemory Hierarchylocality
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.