10 Proven Techniques to Supercharge Your C Code Performance
This article presents ten practical low‑level optimization strategies—ranging from reducing computational workload and extracting common subexpressions to loop unrolling, accumulator parallelism, and conditional‑move coding—each illustrated with C examples, detailed analysis, and before‑after performance comparisons.
Writing programs that remain stable under all conditions requires balancing simplicity with speed; this guide explores ten concrete techniques to improve program performance.
1. Reduce Computational Workload
1.1 Example Code
for (i = 0; i < n; i++) {
int ni = n*i;
for (j = 0; j < n; j++)
a[ni + j] = b[j];
}1.2 Analysis
The outer loop performs a multiplication (i * n) on every iteration. By replacing the multiplication with an addition that increments by n each time, the number of arithmetic operations is reduced.
1.3 Improved Code
int ni = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++)
a[ni + j] = b[j];
ni += n; // replace multiplication with addition
}Multiplication instructions are significantly slower than addition on most CPUs.
2. Extract Common Subexpressions
2.1 Example Code
Assume a 2‑D image stored as a linear array val. The goal is to sum the four orthogonal neighbours of a pixel.
up = val[(i-1)*n + j];
down = val[(i+1)*n + j];
left = val[i*n + j-1];
right = val[i*n + j+1];
sum = up + down + left + right;2.2 Analysis
Compilation reveals three separate i*n multiplications. Since each neighbour expression shares the term i*n + j, we can compute it once and reuse it.
2.3 Improved Code
long inj = i*n + j;
up = val[inj - n];
down = val[inj + n];
left = val[inj - 1];
right = val[inj + 1];
sum = up + down + left + right;The optimized version generates only one multiplication, saving roughly six clock cycles.
3. Eliminate Inefficient Loop Code
3.1 Example Code
void lower1(char *s) {
size_t i;
for (i = 0; i < strlen(s); i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}3.2 Analysis
The loop calls strlen on every iteration, turning an O(N) operation into O(N²) because strlen itself scans the string each time.
3.3 Improved Code
void lower2(char *s) {
size_t i;
size_t len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}Moving the length calculation outside the loop reduces the complexity to linear time, as shown by the performance graph.
4. Remove Unnecessary Memory References
4.1 Example Code
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}4.2 Analysis
The inner loop repeatedly loads b[i] from memory and writes it back, incurring unnecessary memory traffic.
4.3 Improved Code
void sum_rows2(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
double val = 0;
for (j = 0; j < n; j++)
val += a[i*n + j];
b[i] = val;
}
}Using a temporary variable eliminates the repeated memory accesses, as reflected in the streamlined assembly.
5. Reduce Unnecessary Calls
5.1 Example Code
typedef struct { size_t len; data_t *data; } vec;
int get_vec_element(vec *v, size_t idx, data_t *val) {
if (idx >= v->len) return 0;
*val = v->data[idx];
return 1;
}
void combine1(vec_ptr v, data_t *dest) {
long i;
*dest = NULL;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest * val;
}
}5.2 Analysis
Each iteration performs a bounds check inside get_vec_element, which adds overhead.
5.3 Improved Code
size_t vec_length(vec_ptr v);
data_t *get_vec_start(vec_ptr v) { return v->data; }
void combine2(vec_ptr v, data_t *dest) {
long i;
size_t length = vec_length(v);
data_t *data = get_vec_start(v);
*dest = NULL;
for (i = 0; i < length; i++)
*dest = *dest * data[i];
}6. Loop Unrolling
6.1 Example Code
Improving combine2 by processing two elements per iteration.
6.2 Analysis
Unrolling reduces loop‑control overhead and increases instruction‑level parallelism.
6.3 Improved Code
void combine3(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
long limit = length - 1;
data_t *data = get_vec_start(v);
data_t acc = NULL;
for (i = 0; i < limit; i += 2) {
acc = (acc * data[i]) * data[i+1];
}
for (; i < length; i++) {
acc = acc * data[i];
}
*dest = acc;
}Be careful to set limit correctly to avoid out‑of‑bounds accesses.
7. Accumulator Parallelism (2×2 Unrolling)
7.1 Example Code
Further unrolling by maintaining two independent accumulators.
7.2 Analysis
Two accumulators allow the processor to keep multiple functional units busy.
7.3 Improved Code
void combine4(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
long limit = length - 1;
data_t *data = get_vec_start(v);
data_t acc0 = 0, acc1 = 0;
for (i = 0; i < limit; i += 2) {
acc0 = acc0 * data[i];
acc1 = acc1 * data[i+1];
}
for (; i < length; i++)
acc0 = acc0 * data[i];
*dest = acc0 * acc1;
}8. Reassociation Transformation
8.1 Example Code
Reordering multiplication to shorten the critical path.
8.2 Analysis
By grouping operations differently, more work can be performed in parallel, improving pipeline utilization.
8.3 Improved Code
void combine7(vec_ptr v, data_t *dest) {
long i;
long length = vec_length(v);
long limit = length - 1;
data_t *data = get_vec_start(v);
data_t acc = IDENT;
for (i = 0; i < limit; i += 2) {
acc = acc * (data[i] * data[i+1]);
}
for (; i < length; i++)
acc = acc * data[i];
*dest = acc;
}9. Conditional‑Move Style Code
9.1 Example Code
void minmax1(long a[], long b[], long n) {
long i;
for (i = 0; i < n; i++) {
if (a[i] > b[i]) {
long t = a[i];
a[i] = b[i];
b[i] = t;
}
}
}9.2 Analysis
Branch mispredictions on modern CPUs can stall pipelines. Using conditional moves eliminates the branch.
9.3 Improved Code
void minmax2(long a[], long b[], long n) {
long i;
for (i = 0; i < n; i++) {
long min = a[i] < b[i] ? a[i] : b[i];
long max = a[i] < b[i] ? b[i] : a[i];
a[i] = min;
b[i] = max;
}
}This version computes the minimum and maximum with a conditional operator, avoiding branch prediction penalties.
10. Summary
The ten techniques—reducing work, extracting common subexpressions, eliminating redundant calls, removing unnecessary memory traffic, loop unrolling, accumulator parallelism, reassociation, and conditional‑move coding—provide a practical toolbox for low‑level performance tuning in C and C‑like languages. Applying them judiciously can yield measurable speedups without altering program semantics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
