Fundamentals 73 min read

Why Linux Uses Copy‑On‑Write: Boosting Process Creation and Memory Efficiency

This article explains Linux’s Copy‑On‑Write mechanism, detailing how it avoids full memory duplication during process creation, the underlying page‑table workflow, its implementation in the kernel, and real‑world applications such as Redis persistence, Docker image layering, and filesystem snapshots, while also discussing its advantages and drawbacks.

Deepin Linux

Sep 21, 2025

Why Linux Uses Copy‑On‑Write: Boosting Process Creation and Memory Efficiency

When Linux creates a new process, copying the entire memory of the parent each time is both time‑consuming and wasteful – clearly a case of effort without reward. Copy‑on‑Write (COW) takes the opposite approach: it initially avoids any actual copy, letting the parent and child share the same memory region, and only copies the parts that need to be modified when one side attempts to write.

This seemingly ‘cut‑corner’ operation is actually the essence of Linux memory management and a powerful efficiency principle in programming: from container‑engine memory optimizations and database snapshot implementations to resource reuse in programming languages, COW minimizes operations to avoid unnecessary consumption. It is not truly ‘lazy’; it concentrates effort where it matters – after all, the most efficient resource utilization often lies in clever ‘save‑where‑possible’ thinking.

1. What Is Copy‑On‑Write?

In the Linux process‑management domain, the Copy‑on‑Write (COW) mechanism is an extremely clever design that solves many drawbacks of the traditional process‑creation method, greatly improving system performance and resource utilization. To understand the importance of COW, we first need to discuss the pain points of the traditional fork.

1.1 Traditional fork “performance trap”: cost of full copy

In early operating‑system designs, when we used the fork system call to create a child process, it employed a simple but rather “brutal” method: it copied the parent’s code segment, data segment, heap, and stack entirely to the child. Imagine the parent process as a warehouse full of items; fork is like building an identical child warehouse, copying everything regardless of whether the child will ever need those items.

For example, when we execute a simple "ls" command in a shell, the shell (parent) forks a child to run the "ls" program. If the parent occupies tens of megabytes, the fork copies those tens of megabytes to the child, and the child then calls exec to load the "ls" program, replacing most of the copied memory. This is akin to painstakingly copying an entire warehouse only to immediately discard most of the items and replace them with new ones – the previous copy becomes completely useless.

This full copy not only takes time (often milliseconds for tens of megabytes) but also consumes a large amount of physical memory, becoming a performance bottleneck in high‑concurrency scenarios.

1.2 Copy‑On‑Write “breakthrough”: delayed copy wisdom

The emergence of Copy‑on‑Write is like a ray of light for the fork dilemma. Its core idea can be summarized in five Chinese characters: “read‑share, write‑copy”. It completely overturns the traditional full‑copy model.

When a fork that uses COW creates a child, the kernel does not immediately copy the parent’s physical memory. Instead, it copies the parent’s virtual‑address‑space structure, i.e., the page tables. The page tables act like an index map of memory; by copying them, the child appears to have the same memory layout as the parent, but both processes actually share the same physical memory. This is like two warehouses sharing the same storage area while each has its own inventory list (page table) to “view” the memory.

Only when either the parent or the child tries to modify the shared memory does the COW mechanism trigger. The system allocates a new physical page for the modifying process, copies the shared data into the new page, and then the process modifies the new page. The other process continues to see the original data, and the two become independent.

Through this approach, COW avoids massive unnecessary memory copying during child‑process creation, copying only when actual data modification occurs, greatly reducing memory‑copy overhead, improving process‑creation efficiency, and saving precious memory resources for high‑concurrency, memory‑sensitive scenarios.

2. Core Principles of Copy‑On‑Write

2.1 The essence: not “no copy”, but “copy later”

At first glance, COW can be misunderstood as never performing a copy to save memory and time. In reality, COW does not refuse copying; it cleverly delays the copy operation, which can be described as a “smart laziness” strategy.

This strategy originates from a common scenario: after a child process is created, it often immediately calls exec to load a completely new program. For example, when we type "ls -l" in a terminal, the shell forks a child, and the child quickly execs the "ls" program, replacing most of the inherited memory. If the system performed a full memory copy at fork time, it would be a waste of resources.

COW is the optimization for this situation. When using a COW fork, the child and parent share the same physical memory; their virtual address spaces look independent but initially map to the same physical pages. Only when either process attempts a write does COW intervene, allocating a new page, copying the data, and marking the page writable for the writer.

2.2 Three‑step workflow: from sharing to independence

The COW workflow can be clearly divided into three key steps, each tightly connected to achieve efficient memory management.

First step: fork creates and copies page tables : When the parent calls fork, the kernel creates a separate process control block for the child and copies the parent’s page tables. The child now has the same virtual‑address layout, but both processes share the same physical memory. All shared pages are marked read‑only, preparing for later COW.

Second step: read operations share memory : Both parent and child can read the shared memory without extra copying because the pages are read‑only. Reads are fast and efficient.

Third step: write operation triggers copy : When either process attempts to write a shared page, a page‑fault occurs. The kernel allocates a new physical page, copies the original data, updates the page table for the writer, and marks the new page writable. The other process continues to see the original data.

2.3 Two technical pillars: virtual memory and reference counting

The efficient operation of COW relies on two key technologies in Linux: virtual memory and reference counting.

(1) Virtual memory: implementing address isolation and sharing

Virtual memory is a core feature of modern operating systems, providing each process with an independent 4 GB (on 32‑bit) virtual address space. In COW, although parent and child share the same physical memory, their virtual address spaces are independent, allowing the same virtual address to map to different physical pages or, initially, to the same physical page.

This achieves “logical isolation” and “physical sharing”, ensuring process independence while improving memory utilization. For example, both parent and child have a virtual address 0x1000 pointing to the same physical page; when the child writes, COW allocates a new page for the child, while the parent’s 0x1000 still points to the original page.

Virtual memory simplifies several aspects of memory management:

Simplifies linking: consistent virtual address spaces make linker design easier.

Simplifies loading: the loader never copies data from disk to memory; the virtual‑memory system loads pages on demand.

Simplifies sharing: processes can share code and data via shared libraries or memory‑mapped files.

Simplifies allocation: provides a simple mechanism (malloc, etc.) for user processes to obtain extra memory.

In Linux, a page and a page‑frame are typically 4 KB, though sizes may vary.

After pages and page‑frames are divided, addresses are split into high‑order page‑frame numbers and low‑order offsets, representing different meanings.

The page table records the mapping between page numbers and page‑frame numbers.

(2) Reference counting: tracking memory‑sharing state

Reference counting is a technique used to track how many processes reference a memory page. Each page has a reference count initialized to 1; when another process shares the page, the count increments; when a process stops using the page (e.g., exits or COW triggers), the count decrements.

Only when the reference count is greater than 1 and a write occurs does COW trigger. After copying, the original page’s reference count decreases, and the new page’s count is set to 1. This ensures accurate sharing status, avoids unnecessary copies, and prevents memory leaks.

Allocate four bytes (pCount) to record how many pointers refer to this space.

Reserve four bytes at the head of the allocated space to record the pointer count.

When allocating a new space, increase the reference count; when releasing, decrease it. Real release occurs only when the count reaches zero. On modification, the original space’s count is decreased and a new space is allocated.

Linux’s fork uses COW; reference counting is also used in C++ shared_ptr to solve copy problems.

When multiple string objects share the same data, they point to the same memory and maintain a reference count. Only when an object needs to modify the data does it actually copy a new copy, avoiding impact on other shared objects.

Below is a simplified COW string implementation:

A pointer to shared data (including characters and reference count).

Copying only shares data and increments the reference count.

On modification, check the reference count; if greater than 1, copy data first then modify.

#include <cstring>
#include <iostream>

// Shared data structure: stores actual characters and reference count
struct StringData {
    int ref_count;  // reference count
    char* data;     // character data

    // Constructor: initialize data and reference count
    StringData(const char* str) : ref_count(1) {
        data = new char[std::strlen(str) + 1];
        std::strcpy(data, str);
    }

    // Destructor: release character data
    ~StringData() {
        delete[] data;
    }
};

class COWString {
private:
    StringData* ptr;  // pointer to shared data

    // Ensure the object has an exclusive copy before modification
    void detach() {
        if (ptr->ref_count > 1) {
            // Reference count > 1, other objects share the data, need to copy
            StringData* new_ptr = new StringData(ptr->data);
            ptr->ref_count--;  // original data reference count -1
            ptr = new_ptr;     // point to new copy
        }
    }

public:
    // Constructor
    COWString(const char* str = "") : ptr(new StringData(str)) {}

    // Copy constructor: share data, increase reference count
    COWString(const COWString& other) : ptr(other.ptr) {
        ptr->ref_count++;
    }

    // Assignment operator
    COWString& operator=(const COWString& other) {
        if (this != &other) {
            // Decrease current data's reference count, delete if zero
            if (--ptr->ref_count == 0) {
                delete ptr;
            }
            // Share new data, increase reference count
            ptr = other.ptr;
            ptr->ref_count++;
        }
        return *this;
    }

    // Destructor: decrease reference count, delete if zero
    ~COWString() {
        if (--ptr->ref_count == 0) {
            delete ptr;
        }
    }

    // Read (no copy)
    char operator[](int index) const {
        return ptr->data[index];
    }

    // Write (triggers copy)
    char& operator[](int index) {
        detach();  // ensure exclusive data before modification
        return ptr->data[index];
    }

    // Print current string and reference count (for debugging)
    void print(const char* name) const {
        std::cout << name << ": \"" << ptr->data << "\" (ref_count: " << ptr->ref_count << ")
";
    }
};

int main() {
    COWString s1 = "hello";
    s1.print("s1");  // s1: "hello" (ref_count: 1)

    COWString s2 = s1;  // copy, share data
    s1.print("s1");  // s1: "hello" (ref_count: 2)
    s2.print("s2");  // s2: "hello" (ref_count: 2)

    s2[0] = 'H';      // modify s2, trigger COW
    s1.print("s1");  // s1: "hello" (ref_count: 1) unchanged
    s2.print("s2");  // s2: "Hello" (ref_count: 1) new data

    return 0;
}

Another implementation stores the reference count directly at the head of the memory block:

#include <cstring>
#include <iostream>

class COWString {
private:
    char* data;  // points to whole block: [ref_count][char data]

    // Get pointer to reference count (first 4 bytes)
    int* ref_count() const { return (int*)data; }

    // Get pointer to actual string data (offset 4 bytes)
    char* str_data() const { return data + 4; }

    // Allocate new memory: header for ref count, then string
    void allocate(const char* str) {
        int len = std::strlen(str);
        data = new char[4 + len + 1];
        *ref_count() = 1;               // initial ref count = 1
        std::strcpy(str_data(), str);
    }

    // COW: ensure exclusive memory before write
    void detach() {
        if (*ref_count() > 1) {
            char* old_str = str_data();
            int old_ref = *ref_count();
            // Allocate new memory and copy data
            allocate(old_str);
            // Decrease original ref count
            *ref_count() = old_ref - 1;
        }
    }

public:
    // Constructor
    COWString(const char* str = "") { allocate(str); }

    // Copy constructor: share memory, increase ref count
    COWString(const COWString& other) : data(other.data) { (*ref_count())++; }

    // Assignment operator
    COWString& operator=(const COWString& other) {
        if (this != &other) {
            // Release current memory
            if (--(*ref_count()) == 0) {
                delete[] data;
            }
            // Share new memory, increase ref count
            data = other.data;
            (*ref_count())++;
        }
        return *this;
    }

    // Destructor
    ~COWString() {
        if (--(*ref_count()) == 0) {
            delete[] data;
        }
    }

    // Read (no copy)
    char operator[](int index) const { return str_data()[index]; }

    // Write (triggers copy)
    char& operator[](int index) {
        detach();
        return str_data()[index];
    }

    // Debug print
    void print(const char* name) const {
        std::cout << name << ": \"" << str_data() << "\" (ref_count: " << *ref_count() << ")
";
    }
};

int main() {
    COWString s1 = "hello";
    s1.print("s1");  // s1: "hello" (ref_count: 1)

    COWString s2 = s1;  // copy, share memory
    s1.print("s1");  // s1: "hello" (ref_count: 2)
    s2.print("s2");  // s2: "hello" (ref_count: 2)

    s2[0] = 'H';      // modify s2, trigger COW
    s1.print("s1");  // s1: "hello" (ref_count: 1)
    s2.print("s2");  // s2: "Hello" (ref_count: 1)

    return 0;
}

3. How Linux "lands" COW?

3.1 fork and exec "golden pair": best scenario for COW

In Linux, process creation uses a unique two‑step method: fork + exec. This combination pairs perfectly with COW, greatly improving process‑creation efficiency and resource utilization.

The fork system call creates a child that is almost identical to the parent. Traditionally, fork would copy all of the parent’s memory, which is time‑consuming and memory‑intensive. With COW, fork only copies the page tables and task_struct, a tiny operation that usually takes microseconds, allowing the child to be created quickly.

The exec system call then loads a new program into the current process, replacing most of the inherited memory. In many real‑world cases, a child process immediately calls exec after being forked (e.g., typing "ls -l" in a shell). Because the shared pages have not been modified, no memory copy occurs, avoiding the performance penalty of traditional fork.

3.2 Page‑fault handling: kernel response to write

When a process under COW tries to write to a shared read‑only page, the MMU detects the violation and raises a page‑fault. The CPU pauses the process and transfers control to the kernel’s page‑fault handler.

The kernel’s do_page_fault() checks the faulting virtual address and the page’s reference count. If the count is 1, only the page’s permission is changed to writable. If the count is greater than 1, the kernel calls do_wp_page() to allocate a new physical page, copy the data, update the page table for the writer, and decrement the original page’s reference count.

This handling is completely transparent to the process; the kernel performs all copying and permission changes behind the scenes.

3.3 Differences with vfork and clone: COW is not “share forever”

Besides the COW‑based fork, Linux also provides vfork and clone. vfork is more aggressive: the child shares the parent’s entire virtual address space and the parent is blocked until the child calls exec or exit. This can lead to dangerous side effects if the child modifies memory before exec.

clone allows fine‑grained sharing via flags. Using CLONE_VM makes parent and child share the same virtual memory (similar to vfork). When clone is used without CLONE_VM, it defaults to the same COW strategy as fork.

Threads created via pthread_create use clone with CLONE_VM, meaning they share the address space and do not trigger COW, unlike processes which need independent memory.

4. Linux COW Case Studies

4.1 Redis persistence: BGSAVE’s “memory‑saving” secret

Redis uses COW during the BGSAVE command to create an RDB snapshot without blocking the main process. The child process created by fork shares the parent’s memory; because the child only reads data, no copying occurs. When the parent later writes (e.g., updating a key), COW creates a new page for the modified data, while the child continues to read the original page for the snapshot. This ensures snapshot consistency and allows the main process to keep handling client requests.

#include <iostream>
#include <unordered_map>
#include <thread>
#include <mutex>
#include <chrono>
#include <memory>
#include <vector>

// Simulated memory page structure
struct MemoryPage {
    std::unordered_map<std::string, std::string> data; // key‑value pairs in the page
    int ref_count;  // reference count
    std::mutex mtx; // protect page operations
    MemoryPage() : ref_count(1) {}
};

// Simulated Redis server
class RedisServer {
private:
    std::vector<std::shared_ptr<MemoryPage>> pages; // collection of pages
    bool bgsave_running; // whether BGSAVE is running
    std::mutex server_mtx;

    // Find the page that contains a given key (simplified hash mod)
    std::shared_ptr<MemoryPage> find_page(const std::string& key) {
        size_t hash = std::hash<std::string>{}(key);
        size_t idx = hash % pages.size();
        return pages[idx];
    }

public:
    RedisServer(size_t page_count = 4) : bgsave_running(false) {
        for (size_t i = 0; i < page_count; ++i) {
            pages.emplace_back(std::make_shared<MemoryPage>());
        }
    }

    // Write a key‑value pair
    void set(const std::string& key, const std::string& value) {
        std::lock_guard<std::mutex> lock(server_mtx);
        auto page = find_page(key);
        std::lock_guard<std::mutex> page_lock(page->mtx);
        if (bgsave_running && page->ref_count > 1) {
            // Trigger COW: create a new page copy
            auto new_page = std::make_shared<MemoryPage>();
            new_page->data = page->data;
            size_t hash = std::hash<std::string>{}(key);
            size_t idx = hash % pages.size();
            pages[idx] = new_page;
            page->ref_count--;
            page = new_page;
            std::cout << "[Parent] Triggered COW for key: " << key << std::endl;
        }
        page->data[key] = value;
    }

    // Read a key
    std::string get(const std::string& key) {
        std::lock_guard<std::mutex> lock(server_mtx);
        auto page = find_page(key);
        std::lock_guard<std::mutex> page_lock(page->mtx);
        auto it = page->data.find(key);
        return it != page->data.end() ? it->second : "";
    }

    // Simulate BGSAVE command
    void bgsave() {
        {
            std::lock_guard<std::mutex> lock(server_mtx);
            if (bgsave_running) {
                std::cout << "BGSAVE already running" << std::endl;
                return;
            }
            bgsave_running = true;
            // Increase reference count of all pages
            for (auto& page : pages) {
                std::lock_guard<std::mutex> page_lock(page->mtx);
                page->ref_count++;
            }
        }
        // Start child thread to simulate child process
        std::thread([this]() {
            std::cout << "[Child] Starting RDB snapshot..." << std::endl;
            std::vector<std::shared_ptr<MemoryPage>> snapshot_pages;
            {
                std::lock_guard<std::mutex> lock(server_mtx);
                snapshot_pages = pages; // capture current page state
            }
            std::this_thread::sleep_for(std::chrono::seconds(2)); // simulate I/O
            std::cout << "[Child] RDB snapshot content:" << std::endl;
            for (const auto& page : snapshot_pages) {
                std::lock_guard<std::mutex> page_lock(page->mtx);
                for (const auto& [key, value] : page->data) {
                    std::cout << "  " << key << " => " << value << std::endl;
                }
            }
            std::cout << "[Child] RDB snapshot completed" << std::endl;
            // Decrease reference counts after snapshot
            {
                std::lock_guard<std::mutex> lock(server_mtx);
                for (auto& page : snapshot_pages) {
                    std::lock_guard<std::mutex> page_lock(page->mtx);
                    page->ref_count--;
                }
                bgsave_running = false;
            }
        }).detach();
    }
};

int main() {
    RedisServer redis;
    redis.set("name", "redis");
    redis.set("version", "6.2.5");
    redis.set("mode", "cluster");
    std::cout << "Initial data:" << std::endl;
    std::cout << "name: " << redis.get("name") << std::endl;
    std::cout << "version: " << redis.get("version") << std::endl;
    std::cout << std::endl;
    redis.bgsave();
    std::this_thread::sleep_for(std::chrono::milliseconds(500)); // wait for child start
    redis.set("version", "7.0.0"); // this modification triggers COW
    redis.set("author", "antirez"); // new key may not trigger COW
    std::cout << std::endl << "Parent data after modifications:" << std::endl;
    std::cout << "version: " << redis.get("version") << std::endl;
    std::cout << "author: " << redis.get("author") << std::endl;
    std::this_thread::sleep_for(std::chrono::seconds(3)); // wait for child
    return 0;
}

4.2 Docker images: layered storage “COW foundation”

Docker’s image and container management relies heavily on COW, especially in its layered storage design. Docker images use a union‑file‑system (e.g., AUFS, OverlayFS). The base layer (e.g., an Ubuntu image) is marked read‑only and shared among many containers. When a container modifies a file, COW copies the affected file to the container’s writable layer, leaving the base layer unchanged. This sharing saves storage space and speeds up container startup.

#include <iostream>
#include <unordered_map>
#include <string>
#include <vector>
#include <memory>

// Represents an image layer
struct Layer {
    std::string id;                                 // unique layer ID
    bool is_readonly;                               // read‑only flag
    std::unordered_map<std::string, std::string> files; // path → content
    std::shared_ptr<Layer> parent;                   // parent layer
    Layer(std::string id, bool readonly, std::shared_ptr<Layer> parent = nullptr)
        : id(std::move(id)), is_readonly(readonly), parent(std::move(parent)) {}
};

// Simulated Docker image composed of multiple read‑only layers
class DockerImage {
private:
    std::vector<std::shared_ptr<Layer>> layers; // from base to top
public:
    void add_layer(const std::shared_ptr<Layer>& layer) { layers.push_back(layer); }
    std::shared_ptr<Layer> get_top_layer() const { return layers.empty() ? nullptr : layers.back(); }
    void print_layers() const {
        std::cout << "Image layer structure (base to top):" << std::endl;
        for (const auto& layer : layers) {
            std::cout << "  Layer ID: " << layer->id << " (read‑only: " << std::boolalpha << layer->is_readonly << ")" << std::endl;
        }
    }
};

// Simulated Docker container with a writable layer on top of a read‑only image
class Container {
private:
    std::string id;
    std::shared_ptr<Layer> base_layer;    // top read‑only layer of the image
    std::shared_ptr<Layer> writable_layer; // container‑specific writable layer

    // Recursively search for a file in the layer hierarchy
    std::shared_ptr<Layer> find_file_layer(const std::string& path, std::shared_ptr<Layer> current) const {
        if (!current) return nullptr;
        if (current->files.count(path)) return current;
        return find_file_layer(path, current->parent);
    }
public:
    Container(std::string id, const std::shared_ptr<Layer>& base) : id(std::move(id)), base_layer(base) {
        writable_layer = std::make_shared<Layer>("writable-" + this->id, false, base_layer);
    }
    // Read a file (COW read): prefer writable layer, otherwise read from base
    std::string read_file(const std::string& path) const {
        if (writable_layer->files.count(path)) {
            std::cout << "[Container" << id << "] Read from writable layer: " << path << std::endl;
            return writable_layer->files.at(path);
        }
        auto file_layer = find_file_layer(path, base_layer);
        if (file_layer) {
            std::cout << "[Container" << id << "] Read from base layer (" << file_layer->id << "): " << path << std::endl;
            return file_layer->files.at(path);
        }
        return "File not found";
    }
    // Write a file (COW write): if the file exists in a read‑only layer, copy it first
    void write_file(const std::string& path, const std::string& content) {
        auto file_layer = find_file_layer(path, base_layer);
        if (file_layer && file_layer->is_readonly) {
            std::cout << "[Container" << id << "] Trigger COW, copy from read‑only layer (" << file_layer->id << ") to writable layer: " << path << std::endl;
            writable_layer->files[path] = file_layer->files.at(path); // copy original content
        }
        writable_layer->files[path] = content;
        std::cout << "[Container" << id << "] Modified file in writable layer: " << path << std::endl;
    }
    void print_filesystem() const {
        std::cout << "
[Container" << id << "] Filesystem state:" << std::endl;
        std::cout << "  Writable layer files:" << std::endl;
        for (const auto& [path, content] : writable_layer->files) {
            std::cout << "    " << path << " => " << content << std::endl;
        }
    }
};

int main() {
    // Build base image (Ubuntu)
    auto base_layer = std::make_shared<Layer>("base-ubuntu", true);
    base_layer->files["/etc/os-release"] = "NAME=Ubuntu VERSION=20.04";
    base_layer->files["/bin/bash"] = "bash-executable";
    // Add a tools layer
    auto tools_layer = std::make_shared<Layer>("tools-git", true, base_layer);
    tools_layer->files["/usr/bin/git"] = "git-executable";
    // Assemble Docker image
    DockerImage ubuntu_image;
    ubuntu_image.add_layer(base_layer);
    ubuntu_image.add_layer(tools_layer);
    ubuntu_image.print_layers();
    // Launch two containers from the same image
    Container c1("c1", ubuntu_image.get_top_layer());
    Container c2("c2", ubuntu_image.get_top_layer());
    // Container 1 reads and modifies a base file
    std::cout << "
===== Container 1 operations =====" << std::endl;
    c1.read_file("/etc/os-release");
    c1.write_file("/etc/os-release", "NAME=Ubuntu VERSION=22.04"); // triggers COW
    c1.read_file("/etc/os-release");
    // Container 2 reads the same file (unchanged)
    std::cout << "
===== Container 2 operations =====" << std::endl;
    c2.read_file("/etc/os-release");
    // Container 2 installs a new package (writes new file, no COW)
    c2.write_file("/usr/bin/nginx", "nginx-executable");
    // Show filesystem state of both containers
    c1.print_filesystem();
    c2.print_filesystem();
    return 0;
}

4.3 Linux file systems: snapshots and backup “security guarantee”

Modern Linux file systems such as EXT4 and XFS support snapshots based on COW. When a snapshot is created, the filesystem records metadata (inode) and shares the actual data blocks with the original file. If the original file is later modified, COW copies the affected data block to a new block for the file, while the snapshot continues to reference the original block, preserving the pre‑modification state.

#include <iostream>
#include <unordered_map>
#include <string>
#include <vector>
#include <memory>
#include <chrono>
#include <ctime>

// Simulated data block storing actual file content
struct DataBlock {
    std::string content; // block content
    int ref_count;       // reference count (shared by snapshots and file)
    std::string block_id; // unique ID
    DataBlock(std::string id, std::string data) : block_id(std::move(id)), content(std::move(data)), ref_count(1) {}
};

// Simulated inode storing file metadata and pointers to data blocks
struct Inode {
    std::string path;
    std::string owner;
    std::string permissions;
    time_t create_time;
    std::vector<std::shared_ptr<DataBlock>> blocks;
    Inode(std::string p, std::string o, std::string perms) : path(std::move(p)), owner(std::move(o)), permissions(std::move(perms)) {
        create_time = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
    }
    // Copy for snapshot (shares data blocks, increments ref counts)
    std::shared_ptr<Inode> copy() const {
        auto new_inode = std::make_shared<Inode>(path, owner, permissions);
        new_inode->create_time = create_time;
        new_inode->blocks = blocks;
        for (auto& blk : new_inode->blocks) {
            blk->ref_count++;
        }
        return new_inode;
    }
};

// Simulated file system
class FileSystem {
private:
    std::unordered_map<std::string, std::shared_ptr<Inode>> inodes; // path → inode
    std::unordered_map<std::string, std::shared_ptr<DataBlock>> blocks; // block ID → block
    int block_counter = 0;
    // Create a new data block
    std::shared_ptr<DataBlock> create_block(const std::string& content) {
        std::string block_id = "block-" + std::to_string(++block_counter);
        auto blk = std::make_shared<DataBlock>(block_id, content);
        blocks[block_id] = blk;
        return blk;
    }
public:
    // Create a file with initial data chunks
    void create_file(const std::string& path, const std::string& owner, const std::string& perms, const std::vector<std::string>& data_chunks) {
        if (inodes.count(path)) {
            std::cout << "File exists: " << path << std::endl;
            return;
        }
        auto inode = std::make_shared<Inode>(path, owner, perms);
        for (const auto& chunk : data_chunks) {
            inode->blocks.push_back(create_block(chunk));
        }
        inodes[path] = inode;
        std::cout << "Created file: " << path << std::endl;
    }
    // Write to a specific block (may trigger COW)
    void write_file(const std::string& path, int block_idx, const std::string& new_content) {
        if (!inodes.count(path)) {
            std::cout << "File not found: " << path << std::endl;
            return;
        }
        auto inode = inodes[path];
        if (block_idx < 0 || block_idx >= static_cast<int>(inode->blocks.size())) {
            std::cout << "Invalid block index" << std::endl;
            return;
        }
        auto target = inode->blocks[block_idx];
        if (target->ref_count > 1) {
            std::cout << "COW triggered, copying block: " << target->block_id << std::endl;
            auto new_block = create_block(new_content);
            target->ref_count--;
            inode->blocks[block_idx] = new_block;
        } else {
            target->content = new_content;
            std::cout << "Directly modified block: " << target->block_id << std::endl;
        }
    }
    // Read file content
    void read_file(const std::string& path) const {
        if (!inodes.count(path)) {
            std::cout << "File not found: " << path << std::endl;
            return;
        }
        auto inode = inodes.at(path);
        std::cout << "
Read file: " << path << std::endl;
        std::cout << "Metadata – Owner: " << inode->owner << ", Permissions: " << inode->permissions << std::endl;
        std::cout << "Content:" << std::endl;
        for (size_t i = 0; i < inode->blocks.size(); ++i) {
            std::cout << "  Block" << i << ": " << inode->blocks[i]->content << " (ref_count: " << inode->blocks[i]->ref_count << ")" << std::endl;
        }
    }
    // Create a snapshot (shares data blocks)
    std::shared_ptr<Inode> create_snapshot(const std::string& path) {
        if (!inodes.count(path)) {
            std::cout << "File not found: " << path << std::endl;
            return nullptr;
        }
        auto snap = inodes[path]->copy();
        std::cout << "
Created snapshot for: " << path << std::endl;
        return snap;
    }
    // Read snapshot content
    static void read_snapshot(const std::shared_ptr<Inode>& snap) {
        if (!snap) {
            std::cout << "Invalid snapshot" << std::endl;
            return;
        }
        std::cout << "
Read snapshot content: " << snap->path << std::endl;
        for (size_t i = 0; i < snap->blocks.size(); ++i) {
            std::cout << "  Block" << i << ": " << snap->blocks[i]->content << " (ref_count: " << snap->blocks[i]->ref_count << ")" << std::endl;
        }
    }
};

int main() {
    FileSystem fs;
    // Create a file with three chapters
    fs.create_file("/data/report.txt", "user1", "rw-r--r--", {"Chapter 1: Introduction", "Chapter 2: Technical Principles", "Chapter 3: Results"});
    fs.read_file("/data/report.txt");
    // Create a snapshot
    auto snapshot = fs.create_snapshot("/data/report.txt");
    // Modify original file (triggers COW on block 1)
    std::cout << "
Modifying original file block 1..." << std::endl;
    fs.write_file("/data/report.txt", 1, "Chapter 2: Detailed COW Principles");
    fs.read_file("/data/report.txt");
    // Read snapshot (should show original content)
    FileSystem::read_snapshot(snapshot);
    return 0;
}

5. Advantages and Disadvantages of COW

5.1 Advantages: why COW becomes Linux’s “standard”

COW’s memory‑efficiency is a “high‑throughput valve”: many processes often need to read the same data (e.g., multiple shell children reading configuration files). With COW, they share a single physical copy, dramatically reducing memory consumption. Studies show up to 80 % memory savings in read‑heavy multi‑process workloads, allowing more processes to run concurrently.

Performance boost: traditional fork copies all memory, taking milliseconds. COW‑based fork copies only page tables, completing in microseconds. In high‑concurrency servers (e.g., web servers spawning many workers), this reduces latency and prevents request backlogs.

Transparency: developers do not need to modify code to benefit from COW; the kernel handles sharing and copying automatically, lowering development effort.

5.2 Disadvantages: when COW can be counter‑productive

High‑frequency write scenarios can cause a “page‑fault storm”. If parent and child both modify shared pages repeatedly (e.g., a database master‑slave synchronizing data), each write triggers a page‑fault and a copy, consuming CPU and I/O, potentially degrading performance more than a simple full copy would.

Memory fragmentation: frequent small page copies create many tiny memory blocks, reducing the ability to allocate large contiguous regions and lowering overall memory utilization. Although Linux has compaction mechanisms, fragmentation can still impact performance.

Large page‑table copy latency: when a process holds a huge amount of memory (tens of gigabytes), its page table becomes large. Forking such a process requires copying a massive page table, causing a noticeable pause in the parent. In latency‑sensitive applications (e.g., financial trading systems), this pause can be problematic.

6. Using COW to Optimize Your Linux Program

6.1 Core value: “delayed thinking” for resource allocation

COW’s core value lies in its “delayed” strategy: resources are allocated only when truly needed. In memory management, COW avoids copying the whole address space during fork; only when a process writes does it receive a private copy. This principle extends to other components such as filesystem snapshots (Btrfs, ZFS) and database snapshots (SQL Server, Redis), where only changed blocks are copied, minimizing time and space overhead while preserving consistency.

6.2 Practical advice: how to leverage COW in development

Process creation – prefer the “fork + exec” pattern: When you need to start a new program, use fork followed immediately by exec. Fork creates a lightweight child with shared memory; exec replaces the memory, so the child never writes to the shared pages, avoiding unnecessary COW copies. This is ideal for web servers handling new requests.

High‑write concurrency – minimize parent‑child memory interaction: In write‑heavy scenarios (e.g., cache systems), keep child processes read‑only and perform writes in separate processes or threads. This prevents frequent COW triggers and improves stability.

Diagnose fork‑blocking issues – watch parent memory size: If fork appears to block, check the parent’s memory footprint. Large memory ⇒ large page table ⇒ longer copy time. Strategies include disabling huge pages, reducing memory usage before fork, or using incremental process creation.

By applying these practices, you can fully exploit COW’s benefits while avoiding its pitfalls, resulting in more efficient and stable Linux applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management Linux Operating System Copy-on-Write process creation

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.