Container Runtimes Explained: low‑level vs high‑level (containerd, CRI‑O, Docker)
This article outlines the architecture and functions of container runtimes, detailing low‑level, high‑level, and sandbox types, and compares major implementations such as runC, containerd, CRI‑O, and Docker, highlighting their components, image handling, networking, storage, and integration with Kubernetes.
Runtime refers to the phase of a program's lifecycle or the use of a specific language to execute a program. Container runtimes serve a similar purpose—they are software components required to run and manage containers, simplifying secure execution and efficient deployment, and are key parts of container management.
Common container runtimes include runC, containerd, and Docker. Runtimes are categorized into three types: low‑level runtimes, high‑level runtimes, and sandbox/virtualization runtimes.
Low‑level runtime : Provides basic isolation and lifecycle management using Linux cgroups and namespaces. Examples are Docker's runc and LXC. It is lightweight and high‑performance but lacks advanced features.
High‑level runtime : Builds on low‑level runtimes and adds richer features such as networking, storage, monitoring, image transfer, and management tools. Examples include Docker, containerd, and CRI‑O. It offers more functionality at the cost of increased complexity and resource usage.
Sandbox or virtualization runtime : Uses sandbox or virtualization technologies for isolation, offering stronger security but higher performance overhead. Examples are gVisor and Kata Containers.
Choosing a container runtime involves weighing these trade‑offs based on requirements.
Low‑level Container Runtime
A low‑level container runtime (Low level Container Runtime) implements the OCI spec to receive a rootfs and a config.json and run an isolated process. It only runs the process in an isolated resource space and does not provide storage or network implementations; those must be supplied externally via CNI, CSI, etc.
Only understands rootfs and config.json, not images; cannot build, push, or pull images.
Does not provide network implementation; external CNI plugins are needed.
Does not provide persistent storage; users must mount host directories or use CSI.
Bound to specific OS (e.g., runc only on Linux, runhcs only on Windows).
Runtimes that address one or more of these limitations are considered high‑level runtimes.
High‑level Runtime First Task
The primary task of a high‑level runtime is to bridge the OCI image spec and runtime spec, efficiently converting an image into a rootfs and config.json. This involves pulling the image manifest, downloading layers, extracting them, and merging layers into a rootfs.
To convert image layers to a rootfs, each layer is unpacked into a filesystem layer (fs layer) and then merged. Challenges include indexing layers, maintaining parent‑child relationships, and preventing writes to shared layers, which is solved using diffID hashing, storing parent IDs, and employing UnionFS copy‑on‑write techniques.
First problem solved by using the diffID from the image config to generate a unique fs layer ID.
Second problem solved by storing the parent fs layer ID in each layer's index.
Third problem solved by UnionFS copy‑on‑write, creating an upper work layer for writes.
High‑level runtimes also manage container metadata (ID, image info, low‑level runtime description, OCI spec, work layer ID, labels) to coordinate process management, logging, and recovery.
containerd
containerd is a highly modular high‑level runtime where all modules are loaded as RPC services (gRPC or TTRPC) and are plug‑inable. Its design enables strong cross‑platform capabilities and easy embedding into other software, though it incurs RPC overhead between modules.
Key modules:
Content : Indexes image layers by hash, supports fast lookup, and stores labels in boltDB.
Images : Stores reference‑to‑manifest mappings, building complete image information.
Snapshot : Stores and processes extracted fs layers and container work layers, supporting multiple UnionFS drivers.
Containers : Indexes container metadata (low‑level runtime, snapshot key, image reference, etc.) by container ID.
Diff : Computes diffIDs for image layers and validates them against the image config.
containerd provides namespace isolation by placing module data in separate directory trees, allowing it to serve both Docker and Kubernetes simultaneously.
The Tasks module (runtime.PlatformRuntime) manages container processes and interacts with low‑level runtimes. It supports Linux and, since version 1.2.0, Windows via platform‑agnostic code.
When running a container, containerd creates a container object from the Images and Snapshot modules, generates Mount objects, and passes them along with low‑level runtime info and OCI spec to the Tasks module, which launches a containerd‑shim process. The shim mounts the rootfs using UnionFS and invokes the low‑level runtime to start the container process.
Using a shim process isolates container processes from containerd; if containerd restarts, it can quickly re‑establish communication with the shim via its socket, enabling fast recovery.
CRI‑O
CRI‑O implements high‑level runtime functionality using several Go libraries rather than RPC protocols. Its core libraries include containers/image for image downloading and containers/storage for image and container metadata management.
LayerStore handles both image layers and fs layers, creating fs layer directories via a driver (e.g., overlay) and applying diffs.
ImageStore manages image metadata (manifest, config, signature) and indexes layers.
ContainerStore manages container metadata, similar to ImageStore but read‑only.
CRI‑O stores namespace information as JSON in a metadata field, allowing it to serialize and deserialize namespace data.
When creating a container, CRI‑O ensures the image exists, creates a UnionFS from the top layer, generates an OCI config.json, and then uses the specified low‑level runtime (via RuntimeHandler). For non‑VM runtimes it launches conmon to manage the low‑level runtime and provide a socket for communication and logging. For VM runtimes it delegates to a containerd‑shim process.
Networking is provided by CNI plugins; CRI‑O passes the container's network namespace path to CNI, which configures interfaces and IP addresses. Hostname and DNS are configured via file mounts rather than CNI.
Docker
Docker Engine is a comprehensive high‑level runtime consisting of the dockerd daemon, a REST API, and the Docker CLI. It handles image building, distribution, storage, and networking.
Image building converts a directory and Dockerfile into image layers and a config, producing a tag reference. Each Dockerfile instruction may generate intermediate images or containers, which are cached for reuse. The docker build command orchestrates this process.
Docker stores only extracted fs layers and necessary indexes; it does not keep raw tarballs or manifests. Image download fetches the manifest, then parallelly downloads and extracts layers, indexing them by layer chain ID, diff ID, and reference.
Container creation generates a writable layer, merges it with fs layers via UnionFS to form the rootfs, translates image config into an OCI runtime spec, and creates auxiliary files (hosts, hostname, resolv.conf, shim socket). The resulting spec and low‑level runtime info are sent to containerd, which then runs the container similarly to other runtimes.
Docker provides rich storage abstractions (bind mounts, volumes, tmpfs) and network drivers. Endpoints are allocated IPs on the docker0 bridge, and port publishing adds iptables rules and may start a proxy. Network setup is performed via a pre‑start hook in the OCI spec, which the low‑level runtime (e.g., runC) invokes to attach the container to the host network namespace.
Summary
The article presented the fundamentals of container runtimes, comparing low‑level and high‑level implementations and describing containerd, CRI‑O, and Docker in detail. Docker, as the most feature‑rich runtime, sits atop containerd, which itself evolved from Docker’s low‑level components. The ecosystem increasingly favors modular, reusable components adhering to OCI standards, enabling developers to integrate custom isolation, networking, or storage solutions with minimal duplication.
containerd is designed to be embedded into a larger system, rather than being used directly by developers or end‑users。
Source: https://www.zeng.dev/post/2020-container-runtimes/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
