Backend Development 28 min read

Why GCC’s Loop Vectorization Crashed My Code and How to Fix It

A client‑reported segmentation fault was traced to changing GCC’s optimization level from -O2 to -O3, revealing a bug in the -ftree-loop-vectorize option that miscalculates struct sizes, and the article explains the analysis, assembly inspection, NEON details, and a practical workaround.

Alibaba Cloud Developer

Dec 12, 2023

Why GCC’s Loop Vectorization Crashed My Code and How to Fix It

Background: a client reported a segmentation‑fault crash after changing the GCC optimization level from -O2 to -O3. The crash appears consistently and is reproduced with a minimal demo.

1. Finding the culprit

The problematic code copies data from an array of TileContentIndexStruct to an array of TileContentIndex:

void* readTileContentIndexCallback(TileContentIndexStruct *tileIndexData, int32_t count) {
    TileContentIndex* tileContentIndexList = new TileContentIndex[count];
    for (int32_t index = 0; index < count; index++) {
        TileContentIndexStruct &inData = tileIndexData[index];
        TileContentIndex &outData = tileContentIndexList[index];
        outData.urID = (uint16_t)inData.urCode;
        outData.adcode = (uint32_t)inData.adcode;
        outData.level = (uint16_t)inData.levelNumber;
        outData.southWestTileId = (uint32_t)inData.southWestTileId;
        outData.numRows = (uint16_t)inData.numRows;
        outData.numColumns = (uint16_t)inData.numColumns;
        outData.tileIndex = inData.tileContentIndex;
    }
    return tileContentIndexList;
}

Compiling with -O3 reproduces the crash, while -O2 runs correctly.

2. Investigating GCC optimizations

GCC’s -O3 enables all -O2 optimizations plus many additional flags:

Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags: -fgcse-after-reload, -fipa-cp-clone, -floop-interchange, -floop-unroll-and-jam, -fpeel-loops, -fpredictive-commoning, -fsplit-loops, -fsplit-paths, -ftree-loop-distribution, -ftree-partial-pre, -funswitch-loops, -fvect-cost-model=dynamic, -fversion-loops-for-strides

Running gcc -Q -O3 --help=optimizers shows extra loop‑vectorization options such as:

-ftree-loop-distribute-patterns

-ftree-loop-vectorize

-finline-functions

-ftree-slp-vectorize

-floop-interchange

-floop-unroll-and-jam

-ftree-loop-distribution

-funswitch-loops

-fversion-loops-for-strides

The flag -ftree-loop-vectorize performs loop vectorization, which is the root cause of the crash.

2.1 Loop vectorization example

A simple loop before vectorization:

for (int i = 0; i < 16; i++) {
    a[i] = b[i];
}

After vectorization the compiler may emit:

for (int i = 0; i < 16; i+=4) {
    a[i] = b[i];
    a[i+1] = b[i+1];
    a[i+2] = b[i+2];
    a[i+3] = b[i+3];
}

This reduces the number of loop iterations and improves performance.

2.2 Assembly inspection

Compiling the demo with -O2 yields straightforward ARM64 assembly (excerpt):

_Z28readTileContentIndexCallbackP22TileContentIndexStructi:
    .LFB48:
        cmp  w1, #0
        ble  .L2
        mov  x2, x0
        adrp  x3, .LANCHOR0
        add   x3, x3, :lo12:.LANCHOR0
        sub   w1, w1, #1
        add   x1, x1, x1, lsl #2
        add   x0, x0, #40
        add   x0, x0, x1, lsl #3
    .L3:
        ldr  w1, [x2]
        strh w1, [x3]
        ldr  w1, [x2, #4]
        str  w1, [x3, #4]
        ...
        add  x2, x2, #40
        add  x3, x3, #24
        cmp  x2, x0
        bne  .L3
    .L2:
        adrp x0, .LANCHOR0
        add  x0, x0, :lo12:.LANCHOR0
        ret

With -O3 the compiler inserts NEON vector instructions. A representative fragment:

.L5:
    ldr  d0, [x2]
    ldr  d1, [x2, #8]
    zip1 v0.2s, v0.2s, v1.2s
    ins  v0.d[1], v1.d[0]
    xtn  v0.4h, v0.4s
    str  s1, [x7], #24
    st1  {v1.s}[1], [x7]
    ...

The NEON instructions operate on 128‑bit vector registers (v0‑v31). Each register can be viewed as multiple lanes (e.g., v0.8b, v0.4h, v0.2s).

2.3 NEON details

NEON provides SIMD capabilities. For example, zip1 interleaves lanes, xtn narrows a vector, and st1 stores vector elements to memory.

3. Fixing the crash

Disabling the offending optimization eliminates the crash:

g++ -O3 -fno-tree-loop-vectorize -S -o main3t.s main.cpp
g++ -o main3t main3t.s

The crash disappears, confirming that -ftree-loop-vectorize is the trigger.

Further investigation showed that the generated .O3 assembly incorrectly assumes the size of TileContentIndexStruct to be 8 bytes instead of 40 bytes, leading to wrong offsets for fields such as tileContentIndex. Patching the assembly to use the correct 40‑byte stride (e.g., changing add x6, x6, #32 to add x6, x6, #160) also resolves the issue.

4. Conclusions

The problem is a bug in GCC‑arm‑9.2’s loop‑vectorization for this specific pattern. Newer GCC versions (e.g., 10.3) and clang do not exhibit the crash, either because they avoid the optimization or handle the struct size correctly. The pragmatic solution for projects stuck on this compiler version is to keep -O3 for overall performance but disable -ftree-loop-vectorize for the affected translation unit.

Understanding compiler optimizations, inspecting generated assembly, and knowing how to selectively disable problematic passes are essential skills for low‑level performance debugging.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

debugging Optimization Assembly gcc loop vectorization NEON

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.