Fundamentals 18 min read

How to Make Your C Code Run Faster: 8 Proven Optimization Techniques

This article explains why code can run slowly on resource‑constrained devices and presents eight practical techniques—ranging from loop unrolling and memory access reduction to SIMD intrinsics and table look‑ups—to help C programmers write faster, more efficient code.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Make Your C Code Run Faster: 8 Proven Optimization Techniques

Overview

When writing code we often encounter slow execution; high compute resource usage can prevent projects from being feasible, especially on resource‑constrained edge devices. If resources are ample we can enable O3 optimization, but writing code that is easy for the compiler to understand gives even better speedups. The author, experienced with ARM/DSP/GPU acceleration for image and audio algorithms, shares personal insights on making code run faster, using C examples.

Adjusting Source Code

1. Reduce loop overhead and increase parallelism

Modern CPUs are superscalar; they can issue multiple instructions per cycle. Loop unrolling fills the pipeline and improves hardware utilization.

int acc = 0;
for (int i = 0; i < 1000; i++) {
    acc += data[i];
}

Unrolled version (with dependency):

int acc = 0;
for (int i = 0; i < 1000/4; i += 4) {
    acc += data[i];
    acc += data[i+1];
    acc += data[i+2];
    acc += data[i+3];
}

Dependency‑free version using separate accumulators:

int accRes = 0;
int acc[4] = {0};
for (int i = 0; i < 1000/4; i += 4) {
    acc[0] += data[i];
    acc[1] += data[i+1];
    acc[2] += data[i+2];
    acc[3] += data[i+3];
}
accRes = acc[0] + acc[1] + acc[2] + acc[3];

Note: Some compilers already perform unrolling automatically, especially on DSP platforms.

2. Remove unnecessary memory references

Keep data in registers as much as possible and write back only when needed.

for (int i = 0; i < 1000; i += 2) {
    arrA[100] += data[i];
    arrB[50]  += data[i+1];
}

Refactored version:

int a = 0;
int b = 0;
for (int i = 0; i < 1000; i += 2) {
    a += data[i];
    b += data[i+1];
}
arrA[100] += a;
arrB[50]  += b;

3. Avoid branch statements inside loops

Branch misprediction flushes pipelines. Remove conditional checks when possible or use branch‑free logic.

for (int i = 0; i < 1000; i++) {
    if (a > b) {
        // code...
    } else {
        // code...
    }
}

Branch‑free version using separate loops (segmenting the iteration space):

for (int i = 0; i < 3; i++) {
    // code for first segment
}
for (int i = 3; i < 500; i++) {
    // code for middle segment
}
for (int i = 500; i < 1000; i++) {
    // code for last segment
}

4. Merge loop read/write

Combine separate read and write passes to reduce memory traffic.

int tmp[1000];
for (int i = 0; i < 1000; i++) {
    tmp[i] = data0[i] * data1[i];
}
for (int i = 0; i < 1000; i++) {
    out[i] = tmp[i] + data2[i];
}

Merged version:

for (int i = 0; i < 1000; i++) {
    out[i] = data0[i] * data1[i] + data2[i];
}

5. Avoid jump‑point accesses (cache misses)

Cache misses occur when memory accesses break spatial locality. Re‑ordering data can improve cache behavior; matrix‑multiply blocking is a classic example.

Cache hierarchy diagram
Cache hierarchy diagram

6. Eliminate unnecessary memset/memcpy

For large buffers, removing redundant memset or memcpy can noticeably speed up image processing.

// Original with memset
memset(img, 0, 1080*1140*sizeof(int16));
for (int i = 0; i < 1080; i++) {
    for (int j = 0; j < 1140; j++) {
        img[i][j] = img0[i][j] * gain;
    }
}

Optimized version without memset:

for (int i = 0; i < 1080; i++) {
    for (int j = 0; j < 1140; j++) {
        img[i][j] = img0[i][j] * gain;
    }
}

7. Table‑lookup optimization

Pre‑compute expensive results and store them in a lookup table.

// Pre‑compute exponent values
for (int i = 0; i < 255; i++) {
    table[i] = exp(i);
}
for (int i = 0; i < 1000; i++) {
    res[i] = table[data[i]] * gain[i];
}

8. Other useful tricks

Share pipeline reads/writes for multiple image filters to avoid intermediate buffers.

Align struct members to reduce padding.

Replace divisions with multiplications or Newton‑Raphson approximations when high precision is not required.

Convert floating‑point operations to fixed‑point where possible, focusing on bit‑width optimization.

Use compiler‑provided software prefetch intrinsics to improve cache hit rate.

Apply pragma directives to give the compiler extra information for better optimization.

Declare pointer aliasing rules (e.g., __restrict) to help the compiler assume pointers do not overlap.

Using Hardware Acceleration Instructions

Modern processors support SIMD extensions (NEON/MVE on ARM, SSE/AVX on x86, HVX on Hexagon DSP). SIMD can process multiple data elements per instruction, dramatically increasing throughput.

1. SIMD intrinsics for acceleration

Example using ARM NEON intrinsics:

int32x4_t vRes = vdup_n_s32(0);
for (int i = 0; i < 1000/4; i += 4) {
    int32x4_t vData = vld1q_s32(data + i);
    vRes = vaddq_s32(vRes, vData);
}
int accRes = vaddvq_s32(vRes);

2. SIMD to eliminate branches

Branch‑free vector code using masks:

for (int i = 0; i < 1000/4; i += 4) {
    int32x4_t vData = vld1q_s32(dataIn + i);
    int32x4_t v0 = vaddq_s32(vData, vdupq_n_s32(5));
    int32x4_t v1 = vaddq_s32(vData, vdupq_n_s32(10));
    uint32x4_t vMask = vcgtq_s32(vData, vdupq_n_s32(0));
    int32x4_t vDst = vbslq_s32(vMask, v0, v1);
    vst1q_s32(dataOut + i, vDst);
}

3. Hand‑written assembly for fine‑grained tuning

Sometimes inspecting and tweaking the generated assembly yields the biggest gains. The author identified two store instructions that caused register dependencies and reordered them, removing the dependency and improving performance.

... // original fragment with dependent stores
stur q3, [x0, #-32]   // register dependency
... // optimized fragment
stur q3, [x0, #-32]   // dependency removed

DSP compilers often provide cycle‑accurate profiling tools (e.g., Hexagon PMU events, ADI‑2156x cycle counts) that help locate bottlenecks.

Conclusion

Readable code is often slower because it contains redundant calculations and memory accesses. On multi‑core CPUs, parallelism, thread pools, and memory pools can further improve performance. Profiling tools such as perf, flame graphs, and hardware counters are essential to identify whether the bottleneck is compute‑bound or memory‑bound. SIMD, DSP, and GPU acceleration each suit different workloads, and choosing the right platform is key to achieving high performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancecompilerCode OptimizationC programmingSIMDlow-level optimization
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.