Why GCC’s Loop Vectorization Crashed My Code and How to Fix It
A client‑reported segmentation fault was traced to changing GCC’s optimization level from -O2 to -O3, revealing a bug in the -ftree-loop-vectorize option that miscalculates struct sizes, and the article explains the analysis, assembly inspection, NEON details, and a practical workaround.
Background: a client reported a segmentation‑fault crash after changing the GCC optimization level from -O2 to -O3. The crash appears consistently and is reproduced with a minimal demo.
1. Finding the culprit
The problematic code copies data from an array of TileContentIndexStruct to an array of TileContentIndex:
void* readTileContentIndexCallback(TileContentIndexStruct *tileIndexData, int32_t count) {
TileContentIndex* tileContentIndexList = new TileContentIndex[count];
for (int32_t index = 0; index < count; index++) {
TileContentIndexStruct &inData = tileIndexData[index];
TileContentIndex &outData = tileContentIndexList[index];
outData.urID = (uint16_t)inData.urCode;
outData.adcode = (uint32_t)inData.adcode;
outData.level = (uint16_t)inData.levelNumber;
outData.southWestTileId = (uint32_t)inData.southWestTileId;
outData.numRows = (uint16_t)inData.numRows;
outData.numColumns = (uint16_t)inData.numColumns;
outData.tileIndex = inData.tileContentIndex;
}
return tileContentIndexList;
}Compiling with -O3 reproduces the crash, while -O2 runs correctly.
2. Investigating GCC optimizations
GCC’s -O3 enables all -O2 optimizations plus many additional flags:
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags: -fgcse-after-reload, -fipa-cp-clone, -floop-interchange, -floop-unroll-and-jam, -fpeel-loops, -fpredictive-commoning, -fsplit-loops, -fsplit-paths, -ftree-loop-distribution, -ftree-partial-pre, -funswitch-loops, -fvect-cost-model=dynamic, -fversion-loops-for-strides
Running gcc -Q -O3 --help=optimizers shows extra loop‑vectorization options such as:
-ftree-loop-distribute-patterns
-ftree-loop-vectorize
-finline-functions
-ftree-slp-vectorize
-floop-interchange
-floop-unroll-and-jam
-ftree-loop-distribution
-funswitch-loops
-fversion-loops-for-strides
The flag -ftree-loop-vectorize performs loop vectorization, which is the root cause of the crash.
2.1 Loop vectorization example
A simple loop before vectorization:
for (int i = 0; i < 16; i++) {
a[i] = b[i];
}After vectorization the compiler may emit:
for (int i = 0; i < 16; i+=4) {
a[i] = b[i];
a[i+1] = b[i+1];
a[i+2] = b[i+2];
a[i+3] = b[i+3];
}This reduces the number of loop iterations and improves performance.
2.2 Assembly inspection
Compiling the demo with -O2 yields straightforward ARM64 assembly (excerpt):
_Z28readTileContentIndexCallbackP22TileContentIndexStructi:
.LFB48:
cmp w1, #0
ble .L2
mov x2, x0
adrp x3, .LANCHOR0
add x3, x3, :lo12:.LANCHOR0
sub w1, w1, #1
add x1, x1, x1, lsl #2
add x0, x0, #40
add x0, x0, x1, lsl #3
.L3:
ldr w1, [x2]
strh w1, [x3]
ldr w1, [x2, #4]
str w1, [x3, #4]
...
add x2, x2, #40
add x3, x3, #24
cmp x2, x0
bne .L3
.L2:
adrp x0, .LANCHOR0
add x0, x0, :lo12:.LANCHOR0
retWith -O3 the compiler inserts NEON vector instructions. A representative fragment:
.L5:
ldr d0, [x2]
ldr d1, [x2, #8]
zip1 v0.2s, v0.2s, v1.2s
ins v0.d[1], v1.d[0]
xtn v0.4h, v0.4s
str s1, [x7], #24
st1 {v1.s}[1], [x7]
...The NEON instructions operate on 128‑bit vector registers (v0‑v31). Each register can be viewed as multiple lanes (e.g., v0.8b, v0.4h, v0.2s).
2.3 NEON details
NEON provides SIMD capabilities. For example, zip1 interleaves lanes, xtn narrows a vector, and st1 stores vector elements to memory.
3. Fixing the crash
Disabling the offending optimization eliminates the crash:
g++ -O3 -fno-tree-loop-vectorize -S -o main3t.s main.cpp
g++ -o main3t main3t.sThe crash disappears, confirming that -ftree-loop-vectorize is the trigger.
Further investigation showed that the generated .O3 assembly incorrectly assumes the size of TileContentIndexStruct to be 8 bytes instead of 40 bytes, leading to wrong offsets for fields such as tileContentIndex. Patching the assembly to use the correct 40‑byte stride (e.g., changing add x6, x6, #32 to add x6, x6, #160) also resolves the issue.
4. Conclusions
The problem is a bug in GCC‑arm‑9.2’s loop‑vectorization for this specific pattern. Newer GCC versions (e.g., 10.3) and clang do not exhibit the crash, either because they avoid the optimization or handle the struct size correctly. The pragmatic solution for projects stuck on this compiler version is to keep -O3 for overall performance but disable -ftree-loop-vectorize for the affected translation unit.
Understanding compiler optimizations, inspecting generated assembly, and knowing how to selectively disable problematic passes are essential skills for low‑level performance debugging.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
