Fundamentals 18 min read

10 Proven Techniques to Supercharge Your C Code Performance

This article presents ten practical low‑level optimization strategies—ranging from reducing computational workload and extracting common subexpressions to loop unrolling, accumulator parallelism, and conditional‑move coding—each illustrated with C examples, detailed analysis, and before‑after performance comparisons.

Liangxu Linux

Feb 3, 2021

10 Proven Techniques to Supercharge Your C Code Performance

Writing programs that remain stable under all conditions requires balancing simplicity with speed; this guide explores ten concrete techniques to improve program performance.

1. Reduce Computational Workload

1.1 Example Code

for (i = 0; i < n; i++) {
  int ni = n*i;
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
}

1.2 Analysis

The outer loop performs a multiplication (i * n) on every iteration. By replacing the multiplication with an addition that increments by n each time, the number of arithmetic operations is reduced.

1.3 Improved Code

int ni = 0;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
  ni += n; // replace multiplication with addition
}

Multiplication instructions are significantly slower than addition on most CPUs.

2. Extract Common Subexpressions

2.1 Example Code

Assume a 2‑D image stored as a linear array val. The goal is to sum the four orthogonal neighbours of a pixel.

up    = val[(i-1)*n + j];
down  = val[(i+1)*n + j];
left  = val[i*n + j-1];
right = val[i*n + j+1];
sum   = up + down + left + right;

2.2 Analysis

Compilation reveals three separate i*n multiplications. Since each neighbour expression shares the term i*n + j, we can compute it once and reuse it.

2.3 Improved Code

long inj = i*n + j;
up    = val[inj - n];
down  = val[inj + n];
left  = val[inj - 1];
right = val[inj + 1];
sum   = up + down + left + right;

The optimized version generates only one multiplication, saving roughly six clock cycles.

3. Eliminate Inefficient Loop Code

3.1 Example Code

void lower1(char *s) {
  size_t i;
  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

3.2 Analysis

The loop calls strlen on every iteration, turning an O(N) operation into O(N²) because strlen itself scans the string each time.

3.3 Improved Code

void lower2(char *s) {
  size_t i;
  size_t len = strlen(s);
  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

Moving the length calculation outside the loop reduces the complexity to linear time, as shown by the performance graph.

4. Remove Unnecessary Memory References

4.1 Example Code

void sum_rows1(double *a, double *b, long n) {
  long i, j;
  for (i = 0; i < n; i++) {
    b[i] = 0;
    for (j = 0; j < n; j++)
      b[i] += a[i*n + j];
  }
}

4.2 Analysis

The inner loop repeatedly loads b[i] from memory and writes it back, incurring unnecessary memory traffic.

4.3 Improved Code

void sum_rows2(double *a, double *b, long n) {
  long i, j;
  for (i = 0; i < n; i++) {
    double val = 0;
    for (j = 0; j < n; j++)
      val += a[i*n + j];
    b[i] = val;
  }
}

Using a temporary variable eliminates the repeated memory accesses, as reflected in the streamlined assembly.

5. Reduce Unnecessary Calls

5.1 Example Code

typedef struct { size_t len; data_t *data; } vec;

int get_vec_element(vec *v, size_t idx, data_t *val) {
  if (idx >= v->len) return 0;
  *val = v->data[idx];
  return 1;
}

void combine1(vec_ptr v, data_t *dest) {
  long i;
  *dest = NULL;
  for (i = 0; i < vec_length(v); i++) {
    data_t val;
    get_vec_element(v, i, &val);
    *dest = *dest * val;
  }
}

5.2 Analysis

Each iteration performs a bounds check inside get_vec_element, which adds overhead.

5.3 Improved Code

size_t vec_length(vec_ptr v);

data_t *get_vec_start(vec_ptr v) { return v->data; }

void combine2(vec_ptr v, data_t *dest) {
  long i;
  size_t length = vec_length(v);
  data_t *data = get_vec_start(v);
  *dest = NULL;
  for (i = 0; i < length; i++)
    *dest = *dest * data[i];
}

6. Loop Unrolling

6.1 Example Code

Improving combine2 by processing two elements per iteration.

6.2 Analysis

Unrolling reduces loop‑control overhead and increases instruction‑level parallelism.

6.3 Improved Code

void combine3(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  long limit = length - 1;
  data_t *data = get_vec_start(v);
  data_t acc = NULL;
  for (i = 0; i < limit; i += 2) {
    acc = (acc * data[i]) * data[i+1];
  }
  for (; i < length; i++) {
    acc = acc * data[i];
  }
  *dest = acc;
}

Be careful to set limit correctly to avoid out‑of‑bounds accesses.

7. Accumulator Parallelism (2×2 Unrolling)

7.1 Example Code

Further unrolling by maintaining two independent accumulators.

7.2 Analysis

Two accumulators allow the processor to keep multiple functional units busy.

7.3 Improved Code

void combine4(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  long limit = length - 1;
  data_t *data = get_vec_start(v);
  data_t acc0 = 0, acc1 = 0;
  for (i = 0; i < limit; i += 2) {
    acc0 = acc0 * data[i];
    acc1 = acc1 * data[i+1];
  }
  for (; i < length; i++)
    acc0 = acc0 * data[i];
  *dest = acc0 * acc1;
}

8. Reassociation Transformation

8.1 Example Code

Reordering multiplication to shorten the critical path.

8.2 Analysis

By grouping operations differently, more work can be performed in parallel, improving pipeline utilization.

8.3 Improved Code

void combine7(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  long limit = length - 1;
  data_t *data = get_vec_start(v);
  data_t acc = IDENT;
  for (i = 0; i < limit; i += 2) {
    acc = acc * (data[i] * data[i+1]);
  }
  for (; i < length; i++)
    acc = acc * data[i];
  *dest = acc;
}

9. Conditional‑Move Style Code

9.1 Example Code

void minmax1(long a[], long b[], long n) {
  long i;
  for (i = 0; i < n; i++) {
    if (a[i] > b[i]) {
      long t = a[i];
      a[i] = b[i];
      b[i] = t;
    }
  }
}

9.2 Analysis

Branch mispredictions on modern CPUs can stall pipelines. Using conditional moves eliminates the branch.

9.3 Improved Code

void minmax2(long a[], long b[], long n) {
  long i;
  for (i = 0; i < n; i++) {
    long min = a[i] < b[i] ? a[i] : b[i];
    long max = a[i] < b[i] ? b[i] : a[i];
    a[i] = min;
    b[i] = max;
  }
}

This version computes the minimum and maximum with a conditional operator, avoiding branch prediction penalties.

10. Summary

The ten techniques—reducing work, extracting common subexpressions, eliminating redundant calls, removing unnecessary memory traffic, loop unrolling, accumulator parallelism, reassociation, and conditional‑move coding—provide a practical toolbox for low‑level performance tuning in C and C‑like languages. Applying them judiciously can yield measurable speedups without altering program semantics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

optimization C low-level Assembly Loops

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.