Why Does memcpy Scale Non‑Linearly? Exploring Faster Alternatives for Repeated Memory Copies
This post examines why repeated memcpy operations on non‑contiguous destination buffers exhibit non‑linear timing, presents benchmark code and results on an i5 laptop, discusses possible alternatives such as memmove, kernel APIs, and remap_file_pages, and shares community insights.
Background
A program repeatedly copies 1 MiB of source data into a destination buffer that is three times larger, but only every third row of the destination is used. The copy is performed with memcpy(&dst[3*j], &src[j], 1024) inside a loop of 1 000 iterations, repeated 1 000 times (total 1 000 000 copies).
Benchmark program
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
int main(void) {
int i, j, k, m;
struct timeval start_t, end_t;
char *src = malloc(1024 * 1024);
char *dst = malloc(1024 * 1024 * 3);
memset(src, 0x56, 1024*1024);
memset(src, 0x00, 1024*1024);
gettimeofday(&start_t, NULL);
for (k = 0; k < 1000; k++) {
m = k % 3;
for (i = 0; i < 1000; i++) {
j = 1024 * i;
memcpy(&dst[m*j], &src[j], 1024);
}
}
gettimeofday(&end_t, NULL);
printf("#dlt=%02ld Sec %06ld uSec#
", end_t.tv_sec - start_t.tv_sec, end_t.tv_usec - start_t.tv_usec);
return 0;
}Measured performance on Ubuntu 12.04 (i5, 4 GB RAM)
Single memcpy of 1 KiB: ~3.3 ms.
1 000 copies (inner loop only): ~70 ms.
10 000 copies: ~600 ms.
Adding memset(src, k, 1024*1024) inside the outer loop raises the 1 000‑copy time to ~90 ms.
The timing does not scale linearly; the first copy is much slower than subsequent copies, suggesting cache effects.
Observations
Repeated copies benefit from data already being in cache, reducing the per‑copy cost after the first iteration.
Even when the source data is overwritten each iteration, the overall trend (70 ms for 1 000 copies, 600 ms for 10 000 copies) remains, indicating that the memory subsystem and the implementation of memcpy dominate the cost.
Potential alternatives
Community feedback notes that beating the highly tuned glibc memcpy is difficult. Possible avenues include:
Using memmove when overlapping regions are possible (performance is similar, not faster).
Invoking kernel‑level APIs or DMA engines for large, aligned transfers.
Exploring the remap_file_pages system call for page‑level remapping, which may reduce copy overhead in specific scenarios. Reference: http://www.man7.org/linux/man-pages/man2/remap_file_pages.2.html
Figures
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
