Fundamentals 7 min read

Why Does memcpy Scale Non‑Linearly? Exploring Faster Alternatives for Repeated Memory Copies

This post examines why repeated memcpy operations on non‑contiguous destination buffers exhibit non‑linear timing, presents benchmark code and results on an i5 laptop, discusses possible alternatives such as memmove, kernel APIs, and remap_file_pages, and shares community insights.

ITPUB

Jun 17, 2016

Why Does memcpy Scale Non‑Linearly? Exploring Faster Alternatives for Repeated Memory Copies

Background

A program repeatedly copies 1 MiB of source data into a destination buffer that is three times larger, but only every third row of the destination is used. The copy is performed with memcpy(&dst[3*j], &src[j], 1024) inside a loop of 1 000 iterations, repeated 1 000 times (total 1 000 000 copies).

Benchmark program

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

int main(void) {
    int i, j, k, m;
    struct timeval start_t, end_t;
    char *src = malloc(1024 * 1024);
    char *dst = malloc(1024 * 1024 * 3);
    memset(src, 0x56, 1024*1024);
    memset(src, 0x00, 1024*1024);
    gettimeofday(&start_t, NULL);
    for (k = 0; k < 1000; k++) {
        m = k % 3;
        for (i = 0; i < 1000; i++) {
            j = 1024 * i;
            memcpy(&dst[m*j], &src[j], 1024);
        }
    }
    gettimeofday(&end_t, NULL);
    printf("#dlt=%02ld Sec %06ld uSec#
", end_t.tv_sec - start_t.tv_sec, end_t.tv_usec - start_t.tv_usec);
    return 0;
}

Measured performance on Ubuntu 12.04 (i5, 4 GB RAM)

Single memcpy of 1 KiB: ~3.3 ms.

1 000 copies (inner loop only): ~70 ms.

10 000 copies: ~600 ms.

Adding memset(src, k, 1024*1024) inside the outer loop raises the 1 000‑copy time to ~90 ms.

The timing does not scale linearly; the first copy is much slower than subsequent copies, suggesting cache effects.

Observations

Repeated copies benefit from data already being in cache, reducing the per‑copy cost after the first iteration.

Even when the source data is overwritten each iteration, the overall trend (70 ms for 1 000 copies, 600 ms for 10 000 copies) remains, indicating that the memory subsystem and the implementation of memcpy dominate the cost.

Potential alternatives

Community feedback notes that beating the highly tuned glibc memcpy is difficult. Possible avenues include:

Using memmove when overlapping regions are possible (performance is similar, not faster).

Invoking kernel‑level APIs or DMA engines for large, aligned transfers.

Exploring the remap_file_pages system call for page‑level remapping, which may reduce copy overhead in specific scenarios. Reference: http://www.man7.org/linux/man-pages/man2/remap_file_pages.2.html

Figures

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Memcpy memory copy

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Benchmark program

Measured performance on Ubuntu 12.04 (i5, 4 GB RAM)

Observations

Potential alternatives

Figures

ITPUB

How this landed with the community

Was this worth your time?

0 Comments

Measured performance on Ubuntu 12.04 (i5, 4 GB RAM)