Fundamentals 7 min read

Why Does memcpy Scale Non‑Linearly? Exploring Faster Alternatives for Repeated Memory Copies

This post examines why repeated memcpy operations on non‑contiguous destination buffers exhibit non‑linear timing, presents benchmark code and results on an i5 laptop, discusses possible alternatives such as memmove, kernel APIs, and remap_file_pages, and shares community insights.

ITPUB
ITPUB
ITPUB
Why Does memcpy Scale Non‑Linearly? Exploring Faster Alternatives for Repeated Memory Copies

Background

A program repeatedly copies 1 MiB of source data into a destination buffer that is three times larger, but only every third row of the destination is used. The copy is performed with memcpy(&dst[3*j], &src[j], 1024) inside a loop of 1 000 iterations, repeated 1 000 times (total 1 000 000 copies).

Benchmark program

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

int main(void) {
    int i, j, k, m;
    struct timeval start_t, end_t;
    char *src = malloc(1024 * 1024);
    char *dst = malloc(1024 * 1024 * 3);
    memset(src, 0x56, 1024*1024);
    memset(src, 0x00, 1024*1024);
    gettimeofday(&start_t, NULL);
    for (k = 0; k < 1000; k++) {
        m = k % 3;
        for (i = 0; i < 1000; i++) {
            j = 1024 * i;
            memcpy(&dst[m*j], &src[j], 1024);
        }
    }
    gettimeofday(&end_t, NULL);
    printf("#dlt=%02ld Sec %06ld uSec#
", end_t.tv_sec - start_t.tv_sec, end_t.tv_usec - start_t.tv_usec);
    return 0;
}

Measured performance on Ubuntu 12.04 (i5, 4 GB RAM)

Single memcpy of 1 KiB: ~3.3 ms.

1 000 copies (inner loop only): ~70 ms.

10 000 copies: ~600 ms.

Adding memset(src, k, 1024*1024) inside the outer loop raises the 1 000‑copy time to ~90 ms.

The timing does not scale linearly; the first copy is much slower than subsequent copies, suggesting cache effects.

Observations

Repeated copies benefit from data already being in cache, reducing the per‑copy cost after the first iteration.

Even when the source data is overwritten each iteration, the overall trend (70 ms for 1 000 copies, 600 ms for 10 000 copies) remains, indicating that the memory subsystem and the implementation of memcpy dominate the cost.

Potential alternatives

Community feedback notes that beating the highly tuned glibc memcpy is difficult. Possible avenues include:

Using memmove when overlapping regions are possible (performance is similar, not faster).

Invoking kernel‑level APIs or DMA engines for large, aligned transfers.

Exploring the remap_file_pages system call for page‑level remapping, which may reduce copy overhead in specific scenarios. Reference: http://www.man7.org/linux/man-pages/man2/remap_file_pages.2.html

Figures

Performance chart
Performance chart
10K iteration timing
10K iteration timing
Cache effect observation
Cache effect observation
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceoptimizationMemcpymemory copy
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.