Why We Dropped Docker: A Full Production Migration to Containerd
This article recounts how our team, after repeated Docker daemon failures on a 500‑node Kubernetes cluster, performed a zero‑downtime migration to Containerd, detailing architectural differences, preparation steps, migration procedures, performance benchmarks, post‑migration adjustments, common pitfalls, and best practices for large‑scale production environments.
From Docker to Containerd: A Complete Production Runtime Migration Review
Introduction: Why We Abandon Docker
In early 2025, the on‑call team was awakened by a third Docker daemon crash that caused all containers on a node to disappear. Managing a 500+ node Kubernetes cluster, we decided it was time to replace Docker.
Docker daemon is a single point of failure; when it crashes, all containers on the node are affected.
The architecture is bloated, causing unnecessary performance overhead.
Complex call chains increase troubleshooting difficulty.
High resource consumption becomes critical in tight environments.
This article provides a complete, zero‑downtime migration from Docker to Containerd and shares the benefits and lessons learned.
1. Understanding the Essence: Docker vs Containerd Architecture Comparison
1.1 Docker’s "Luxury Package" Model
Docker offers a full‑stack container platform. Its architecture resembles a layered stack:
user command (docker run)
↓
Docker CLI
↓
Docker Daemon (dockerd)
↓
Containerd
↓
containerd-shim
↓
runC
↓
Linux Kernel (namespaces, cgroups)The Docker daemon handles image management, networking, storage drivers, API requests, logging, etc. While this "all‑in‑one" design lowers the entry barrier, it becomes a performance bottleneck and stability risk in production.
1.2 Containerd’s "Minimalist" Philosophy
Containerd adopts a lean design:
Container Runtime Interface (CRI)
↓
Containerd
↓
containerd-shim
↓
runC
↓
Linux KernelContainerd removes the heavyweight daemon layer and focuses only on:
Container lifecycle management
Image distribution
Storage management
Network interface
The benefits are immediate:
Less memory usage : tests show ~40% lower memory consumption compared to Docker.
Faster container start : average start time reduced by 30%.
Higher stability : no single‑point‑failure Docker daemon.
2. Preparation Before Migration: Assessment and Planning
2.1 Current State Assessment Checklist
We spent two weeks evaluating the environment. The checklist includes:
Assessment Items:
- Kubernetes version >= 1.20
- Node OS: Ubuntu 20.04 / CentOS 8+
- Kernel version >= 4.19
- Record existing Docker version for rollback
Dependency Checks:
- Scripts that call docker directly
- CI/CD pipelines using docker build
- Monitoring systems depending on Docker metrics
- Log collection depending on Docker log driver2.2 Compatibility Test Matrix
A matrix was built to verify that key functionalities work with Containerd:
Test Item Docker Result Containerd Result Compatibility Solution
Container create/delete Normal Normal ✅ -
Public image pull Normal Normal ✅ -
Private image pull Normal Needs config ⚠️ Configure registry auth
Network Docker network CNI ⚠️ Install CNI plugin
Container logs json-file stdout/stderr ⚠️ Adjust log collector
GPU support nvidia-docker Needs config ⚠️ Add nvidia‑container‑runtime
...2.3 Performance Benchmark
Before migration, baseline performance was measured:
# Container startup time test
for i in {1..100}; do
time docker run --rm alpine echo "test" >> docker_startup.log
done
# Concurrent creation test
parallel -j 50 docker run -d nginx ::: {1..500}
# Memory usage test
systemctl status docker | grep Memory
ps aux | grep dockerd | awk '{print $6}'
# CPU usage test
top -b -n 10 | grep dockerd > docker_cpu.logAverage container start time: 1.2 s
Docker daemon memory usage: ~800 MB
CPU peak under high concurrency: up to 200%
3. Migration Implementation: Step‑by‑Step Guide and Pitfalls
3.1 Test Environment First
Never experiment directly in production. All steps were first validated in a staging cluster.
Step 1: Install Containerd
# Ubuntu
apt-get update
apt-get install -y containerd.io
# CentOS
yum install -y containerd.io
# Generate default config
containerd config default > /etc/containerd/config.tomlStep 2: Critical Configuration Adjustments
# /etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
# Disable unnecessary services
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"
# Sandbox image (pause container)
sandbox_image = "registry.k8s.io/pause:3.9"
# Log configuration
max_container_log_line_size = 16384
max_concurrent_downloads = 5
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = trueStep 3: Private Registry Authentication
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.example.com".auth]
username = "admin"
password = "your-password"
# Or use token
auth = "base64(username:password)"3.2 Kubernetes Cluster Adaptation
Modify kubelet to use Containerd:
# /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock"Configure crictl:
# /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: falseVerify the setup:
# Check Containerd status
crictl version
crictl info
crictl pull nginx:latest
crictl run container-config.json pod-config.json3.3 Gray‑Scale Migration Strategy
#!/bin/bash
NODE=$1
# Drain node
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
# Stop kubelet
systemctl stop kubelet
# Stop and uninstall Docker
systemctl stop docker
apt-get remove -y docker-ce docker-ce-cli
# Start Containerd
systemctl restart containerd
systemctl enable containerd
# Update kubelet config
sed -i 's/--container-runtime=docker/--container-runtime=remote/g' /var/lib/kubelet/kubeadm-flags.env
echo 'KUBELET_EXTRA_ARGS="--container-runtime-endpoint=unix:///run/containerd/containerd.sock"' > /etc/default/kubelet
# Restart kubelet
systemctl restart kubelet
# Wait for node ready
kubectl wait --for=condition=Ready node/$NODE --timeout=300s
# Uncordon node
kubectl uncordon $NODE
echo "Node $NODE migration completed"3.4 Common Issues and Solutions
Issue 1: Image pull failure
# Check registry config
cat /etc/containerd/config.toml | grep -A 10 registry
# Test connectivity
curl -v https://your-registry.com/v2/
# View logs
journalctl -u containerd -fIssue 2: Network connectivity
# Verify CNI plugin
ls /opt/cni/bin/
cat /etc/cni/net.d/*.conf
# Reinstall CNI
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
tar -xvf cni-plugins-linux-amd64-v1.3.0.tgz -C /opt/cni/bin/Issue 3: GPU container failure
# Add nvidia runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"4. Post‑Migration Operations Adjustments
4.1 Monitoring System Refactor
Prometheus job switched from Docker exporter to Containerd metrics endpoint:
# Prometheus job for Containerd
- job_name: 'containerd'
static_configs:
- targets: ['localhost:1338']Enable metrics in Containerd:
# /etc/containerd/config.toml
[metrics]
address = "127.0.0.1:1338"
grpc_histogram = false4.2 Log Collection Scheme Adjustment
Filebeat now reads logs from the pod directory instead of Docker’s path:
# Filebeat inputs
- type: log
paths:
- '/var/log/pods/*/*/*.log'
symlinks: true
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/pods/"4.3 CI/CD Process Refactor
Replace Docker commands with nerdctl (Docker‑compatible CLI for Containerd):
# Install nerdctl
wget https://github.com/containerd/nerdctl/releases/download/v1.5.0/nerdctl-1.5.0-linux-amd64.tar.gz
tar -xvf nerdctl-1.5.0-linux-amd64.tar.gz -C /usr/local/bin/
# Alias docker to nerdctl
alias docker='nerdctl'
# Common command mapping
# docker build → nerdctl build
# docker run → nerdctl run
# docker push → nerdctl push
# docker images → nerdctl imagesFor image builds, use BuildKit directly:
# Run BuildKit daemon
docker run --detach --rm --privileged \
--name buildkitd \
--publish 1234:1234 \
moby/buildkit:latest \
--addr tcp://0.0.0.0:1234
# Build image with buildctl
buildctl --addr tcp://localhost:1234 build \
--frontend dockerfile.v0 \
--local context=. \
--local dockerfile=. \
--output type=image,name=registry.example.com/myapp:latest,push=true5. Performance Optimization and Tuning
5.1 Containerd Performance Tuning Parameters
# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri"]
max_concurrent_downloads = 10
disable_snapshot_unpack = false
discard_unpacked_layers = true
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "native"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
NoPivotRoot = false
NoNewKeyring = false
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"5.2 System‑Level Optimization
# /etc/sysctl.d/99-containerd.conf
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
vm.max_map_count = 262144
vm.swappiness = 10
vm.overcommit_memory = 1
kernel.pid_max = 4194304 # /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
LimitRSS=infinity
Restart=always
RestartSec=55.3 Storage Driver Optimization
# Use overlayfs (recommended)
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.snapshots.overlayfs]
root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
upperdir_label = false6. Production Case Study
6.1 Large‑Scale Cluster Migration Case
Cluster size: 500+ nodes, >15,000 Pods, 50,000+ daily image pulls, SLA 99.99%.
# Migration timeline (simplified)
Phase | Weeks | Work
-----------|---------|-----------------------------------
Preparation| 1‑2 | Environment assessment, design
Testing | 3‑4 | Verify in test cluster
Gray‑scale | 5‑8 | Migrate 10% of nodes
Rollout | 9‑12 | Migrate 50% of nodes
Finalizing |13‑14 | 100% migration completed6.2 Migration Effect Data
Resource usage comparison:
Docker period:
- Docker daemon memory: 800 MB‑1.2 GB
- System CPU idle: ~65%
- Disk I/O wait: ~8%
Containerd period:
- Containerd memory: 200 MB‑400 MB
- System CPU idle: ~78%
- Disk I/O wait: ~3%Container operation performance:
Operation | Docker (s) | Containerd (s) | Improvement
------------|-----------|----------------|------------
Create | 1.2 | 0.8 | 33%
Start | 0.8 | 0.5 | 37%
Pull image | 45 | 32 | 29%
Delete | 0.5 | 0.3 | 40%Fault recovery improvement:
Docker daemon crashes caused 15 min average recovery and affected 30‑50 Pods per node.
Containerd reduced crash rate by 90%; even if Containerd restarts, containers keep running, recovery < 1 min.
7. Pitfalls Summary and Best Practices
7.1 Ten Most Common Pitfalls
Image registry authentication errors : copying Docker’s config.json does not work; use Containerd‑specific auth sections.
CNI plugin missing : Containerd does not install CNI automatically; install and configure manually.
Log path changes : Docker logs are under /var/lib/docker; Containerd uses stdout/stderr, typically /var/log/pods/.
crictl command confusion : crictl is a debugging tool, not a Docker replacement; use nerdctl for production commands.
cgroup driver mismatch : ensure both Containerd and kubelet use systemd cgroup driver.
Private image pull failures : configure registry.mirrors and auth correctly.
Pause container image missing : use a compatible pause image from a trusted registry.
Missing monitoring metrics : replace docker_exporter with Containerd metrics endpoint.
GPU support : configure nvidia‑container‑runtime manually; it is not auto‑detected.
Rollback plan absent : keep Docker data for at least a month to allow rollback.
7.2 Best Practice Recommendations
1. Define a detailed rollback plan:
#!/bin/bash
NODE=$1
kubectl drain $NODE --ignore-daemonsets
systemctl stop kubelet
systemctl stop containerd
apt-get install -y docker-ce docker-ce-cli
systemctl start docker
sed -i 's/--container-runtime=remote/--container-runtime=docker/g' /var/lib/kubelet/kubeadm-flags.env
rm /etc/default/kubelet
systemctl restart kubelet
kubectl uncordon $NODE2. Build comprehensive monitoring alerts:
# Prometheus alert rules
groups:
- name: containerd
rules:
- alert: ContainerdDown
expr: up{job="containerd"} == 0
for: 5m
annotations:
summary: "Containerd is down on {{ $labels.instance }}"
- alert: ContainerdHighMemory
expr: process_resident_memory_bytes{job="containerd"} > 1073741824
for: 10m
annotations:
summary: "Containerd memory usage is too high"3. Standardize operational procedures (SOPs): daily health checks, fault‑troubleshooting flow, performance tuning guide, upgrade manual.
4. Continuous performance monitoring:
#!/bin/bash
while true; do
echo "=== $(date) ===" >> /var/log/containerd-perf.log
ps aux | grep containerd | grep -v grep >> /var/log/containerd-perf.log
crictl ps | wc -l >> /var/log/containerd-perf.log
time crictl ps > /dev/null 2>> /var/log/containerd-perf.log
sleep 60
done8. Future Outlook: Evolution of Cloud‑Native Runtimes
8.1 Industry Trend Analysis
Standardization : OCI and CRI standards enable interchangeable runtimes.
Lightweight : removing unnecessary components focuses on core capabilities.
Cloud‑Native : tighter integration with Kubernetes and other orchestration systems.
Security : smaller attack surface reduces risk.
8.2 Emerging Technologies
1. Kata Containers – secure container runtime:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"2. gVisor – application‑level kernel isolation:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"3. Wasm containers – next‑generation lightweight runtimes (WasmEdge, Wasmtime, Spin).
8.3 Advice for Operations Engineers
Continuous learning : container technology evolves rapidly; keep up‑to‑date.
Practice‑first : build test environments and experiment hands‑on.
Community involvement : join CNCF and related projects for first‑hand information.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
