Cloud Computing 24 min read

Rethinking Multi‑Kernel Design: Isolation‑Based Architecture for Cloud Computing

This article presents an isolation‑focused multi‑kernel architecture that replaces replica‑based designs with dynamic resource management using device‑tree overlays, reuses existing Linux mechanisms such as kexec and hot‑plug, and enables zero‑downtime updates, hardware‑queue isolation, and fault‑tolerant kernel recovery for modern cloud infrastructures.

Linux Kernel Journey

Oct 31, 2025

Rethinking Multi‑Kernel Design: Isolation‑Based Architecture for Cloud Computing

Introduction

Building on prior multi‑kernel research, we propose an isolation‑centric design that emphasizes high customisation and dynamic resource management, moving multi‑kernel systems from experimental prototypes toward practical cloud infrastructure.

1. Resource Management Challenges

1.1 Static Resource Management Issues

Traditional multi‑kernel systems (Barrelfish, Popcorn Linux, Jailhouse) rely on static partitioning or replication, leading to under‑utilised CPUs, lack of flexibility, and poor fit for elastic cloud workloads.

1.2 Key Insight

Static partitioning is merely the default state of a dynamic system; dynamic allocation is essential for any cloud‑targeted multi‑kernel.

2. Device‑Tree‑Based Resource Management

2.1 Describing Resources with Device Trees

Each spawn kernel receives a device‑tree blob via KHO (Kexec HandOver) that describes allocated CPUs, memory, and PCI devices. This leverages the mature, architecture‑agnostic device‑tree support already present in the Linux kernel.

/multikernel-v1/;
{
    compatible = "linux,multikernel";
    instances {
        web-server {
            id = <1>;
            resources {
                cpus = <1>;
                memory-bytes = <0x20000000>;   // 512MB size
                devices = <&enp9s0_dev>;
            };
        };
    };
    enp9s0_dev: ethernet@0 {
        pci-id = "0000:09:00.0";
        vendor-id = <0x1af4>;
        device-id = <0x1041>;
    };
};

2.2 Dynamic Adjustments via Device‑Tree Overlays

When the host kernel decides to reallocate resources, it generates an overlay describing the changes and sends it to the target kernel via IPI. The target applies the overlay, triggering hot‑plug of CPUs, memory, or devices.

/multikernel-v1/;
/plugin/;
{
    fragment@0 {
        target-path = "/";
        __overlay__ {
            multikernel-resources {
                #address-cells = <2>;
                #size-cells = <2>;
                cpu-remove {
                    mk,instance = "/";
                    #address-cells = <1>;
                    #size-cells = <0>;
                    cpu@2 { reg = <2>; };
                    cpu@3 { reg = <3>; };
                };
                cpu-add {
                    mk,instance = "web-server";
                    #address-cells = <1>;
                    #size-cells = <0>;
                    cpu@2 { reg = <2>; numa-node = <0>; };
                    cpu@3 { reg = <3>; numa-node = <0>; };
                };
            };
        };
    };
};

The initial device tree provides a static baseline, while overlays enable standardized, community‑accepted dynamic updates without inventing new protocols.

3. Reusing Existing Linux Infrastructure

3.1 Kernel Loading with kexec_file_load()

We load spawn kernels using the Linux kexec_file_load() system call, which already supports signature verification and fast reboot. By loading each kernel image into a separate memory region, we achieve isolation with minimal new code.

3.2 State Transfer via KHO

KHO carries three components: the FDT describing resources, the boot protocol, and temporary memory management. It allows the host to pass a complete device tree to the child kernel, replacing traditional kernel command‑line parameters with a more flexible interface.

3.3 CPU and Memory Hot‑Plug

We rely on Linux’s built‑in hot‑plug mechanisms for CPUs, memory, and PCI devices, enabling dynamic addition or removal of resources without extensive development effort.

4. Managing Limited Hardware Resources

4.1 Single NIC Challenge

In systems with many CPUs but a single network interface, we avoid SR‑IOV by assigning dedicated hardware queues to each kernel.

4.2 Hardware Queue Isolation

Each kernel receives its own NIC queue, with XDP programs routing packets to the appropriate queue, delivering near‑native performance without virtualization overhead.

4.3 Shared‑Memory Packet Transfer

We reuse AF_XDP’s ring buffers for zero‑copy packet sharing between the host (which owns the NIC driver) and child kernels, allowing the latter to process packets via a software NIC.

4.4 Storage Access via ublk

The host runs a ublk server managing physical storage, while child kernels act as ublk clients, issuing I/O through shared io_uring buffers. LVM provides isolated storage views per kernel.

5. Cross‑Kernel Communication

5.1 IPI‑Based Messaging

Doorbell notifications for new shared‑memory data

Coordination of resource allocation and updates

Urgent alerts and error handling

Synchronization for zero‑downtime updates

IPI combined with shared memory forms the basis for higher‑level protocols.

6. Zero‑Downtime Kernel Updates

6.1 Limitations of Existing Approaches

Google’s Live Update Orchestrator (LUO) still incurs a stop‑the‑world pause during kexec, creating a downtime window.

6.2 Parallel Kernel Execution Model

Old and new kernel instances run concurrently on separate CPU sets. Applications migrate gradually, eliminating service interruption.

Parallel kernel launch: The original kernel spawns a new instance on dedicated cores while continuing normal operation.

Process‑level migration: Each process is checkpointed, its state (memory, file descriptors, registers) transferred, and restored in the new kernel.

TCP connection preservation: TCP_REPAIR captures full socket state, enabling seamless handover.

Device handoff: I/O devices are quiesced, state serialized, and transferred to the new kernel, which remaps DMA.

Rollback capability: If migration fails, the old kernel remains running, allowing a simple abort.

6.3 Comparison with LUO

LUO pauses all applications; our approach keeps them running.

LUO’s all‑or‑nothing state transfer risks whole‑system failure; our per‑process migration limits impact.

LUO requires complex rollback after kexec; our model aborts migration and continues with the old kernel.

LUO depends on extensive kernel‑subsystem modifications; we leverage existing checkpoint/restart mechanisms.

7. Fault‑Tolerant Parallel Backup Kernels

We run a lightweight backup kernel alongside the primary. The backup continuously mirrors critical state via shared memory and monitors the primary’s health.

Automatic failover: On primary crash, the backup instantly assumes control, using the mirrored state to restore processes and TCP connections within sub‑second latency.

Identical kernel versions: Eliminates compatibility issues and allows direct memory copying.

8. Security Model and Hardware Challenges

8.1 Kernel‑Enforced Isolation

Each kernel is trusted to respect its resource boundaries, reducing attack surface and providing high‑performance fast paths without virtualization overhead.

8.2 Hardware‑Assisted Enhancements

We propose using CHERI capabilities for fine‑grained memory protection and configurable IPI filtering to prevent arbitrary inter‑kernel interrupts.

Conclusion

The isolation‑based multi‑kernel architecture delivers dynamic resource management, reuse of proven Linux components, hardware‑queue isolation, zero‑downtime updates, and automatic fault recovery, offering a compelling third option between containers and full virtual machines for modern cloud workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Device Tree kexec dynamic resource management hardware queue isolation kernel fault tolerance multi-kernel zero-downtime update

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.