Cloud Native 26 min read

Why We Dropped Docker: A Full Production Migration to Containerd

This article recounts how our team, after repeated Docker daemon failures on a 500‑node Kubernetes cluster, performed a zero‑downtime migration to Containerd, detailing architectural differences, preparation steps, migration procedures, performance benchmarks, post‑migration adjustments, common pitfalls, and best practices for large‑scale production environments.

Ops Community
Ops Community
Ops Community
Why We Dropped Docker: A Full Production Migration to Containerd

From Docker to Containerd: A Complete Production Runtime Migration Review

Introduction: Why We Abandon Docker

In early 2025, the on‑call team was awakened by a third Docker daemon crash that caused all containers on a node to disappear. Managing a 500+ node Kubernetes cluster, we decided it was time to replace Docker.

Docker daemon is a single point of failure; when it crashes, all containers on the node are affected.

The architecture is bloated, causing unnecessary performance overhead.

Complex call chains increase troubleshooting difficulty.

High resource consumption becomes critical in tight environments.

This article provides a complete, zero‑downtime migration from Docker to Containerd and shares the benefits and lessons learned.

1. Understanding the Essence: Docker vs Containerd Architecture Comparison

1.1 Docker’s "Luxury Package" Model

Docker offers a full‑stack container platform. Its architecture resembles a layered stack:

user command (docker run)
   ↓
Docker CLI
   ↓
Docker Daemon (dockerd)
   ↓
Containerd
   ↓
containerd-shim
   ↓
runC
   ↓
Linux Kernel (namespaces, cgroups)

The Docker daemon handles image management, networking, storage drivers, API requests, logging, etc. While this "all‑in‑one" design lowers the entry barrier, it becomes a performance bottleneck and stability risk in production.

1.2 Containerd’s "Minimalist" Philosophy

Containerd adopts a lean design:

Container Runtime Interface (CRI)
   ↓
Containerd
   ↓
containerd-shim
   ↓
runC
   ↓
Linux Kernel

Containerd removes the heavyweight daemon layer and focuses only on:

Container lifecycle management

Image distribution

Storage management

Network interface

The benefits are immediate:

Less memory usage : tests show ~40% lower memory consumption compared to Docker.

Faster container start : average start time reduced by 30%.

Higher stability : no single‑point‑failure Docker daemon.

2. Preparation Before Migration: Assessment and Planning

2.1 Current State Assessment Checklist

We spent two weeks evaluating the environment. The checklist includes:

Assessment Items:
- Kubernetes version >= 1.20
- Node OS: Ubuntu 20.04 / CentOS 8+
- Kernel version >= 4.19
- Record existing Docker version for rollback

Dependency Checks:
- Scripts that call docker directly
- CI/CD pipelines using docker build
- Monitoring systems depending on Docker metrics
- Log collection depending on Docker log driver

2.2 Compatibility Test Matrix

A matrix was built to verify that key functionalities work with Containerd:

Test Item               Docker Result   Containerd Result   Compatibility   Solution
Container create/delete  Normal         Normal              ✅               -
Public image pull        Normal         Normal              ✅               -
Private image pull       Normal         Needs config         ⚠️               Configure registry auth
Network                 Docker network CNI                  ⚠️               Install CNI plugin
Container logs          json-file      stdout/stderr        ⚠️               Adjust log collector
GPU support              nvidia-docker  Needs config         ⚠️               Add nvidia‑container‑runtime
...

2.3 Performance Benchmark

Before migration, baseline performance was measured:

# Container startup time test
for i in {1..100}; do
  time docker run --rm alpine echo "test" >> docker_startup.log
 done

# Concurrent creation test
parallel -j 50 docker run -d nginx ::: {1..500}

# Memory usage test
systemctl status docker | grep Memory
ps aux | grep dockerd | awk '{print $6}'

# CPU usage test
top -b -n 10 | grep dockerd > docker_cpu.log

Average container start time: 1.2 s

Docker daemon memory usage: ~800 MB

CPU peak under high concurrency: up to 200%

3. Migration Implementation: Step‑by‑Step Guide and Pitfalls

3.1 Test Environment First

Never experiment directly in production. All steps were first validated in a staging cluster.

Step 1: Install Containerd

# Ubuntu
apt-get update
apt-get install -y containerd.io

# CentOS
yum install -y containerd.io

# Generate default config
containerd config default > /etc/containerd/config.toml

Step 2: Critical Configuration Adjustments

# /etc/containerd/config.toml
version = 2

[plugins]
[plugins."io.containerd.grpc.v1.cri"]
# Disable unnecessary services
disable_tcp_service = true
stream_server_address = "127.0.0.1"
stream_server_port = "0"

# Sandbox image (pause container)
sandbox_image = "registry.k8s.io/pause:3.9"

# Log configuration
max_container_log_line_size = 16384
max_concurrent_downloads = 5

[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

Step 3: Private Registry Authentication

# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.example.com".auth]
username = "admin"
password = "your-password"

# Or use token
auth = "base64(username:password)"

3.2 Kubernetes Cluster Adaptation

Modify kubelet to use Containerd:

# /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock"

Configure crictl:

# /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false

Verify the setup:

# Check Containerd status
crictl version
crictl info
crictl pull nginx:latest
crictl run container-config.json pod-config.json

3.3 Gray‑Scale Migration Strategy

#!/bin/bash
NODE=$1
# Drain node
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
# Stop kubelet
systemctl stop kubelet
# Stop and uninstall Docker
systemctl stop docker
apt-get remove -y docker-ce docker-ce-cli
# Start Containerd
systemctl restart containerd
systemctl enable containerd
# Update kubelet config
sed -i 's/--container-runtime=docker/--container-runtime=remote/g' /var/lib/kubelet/kubeadm-flags.env
echo 'KUBELET_EXTRA_ARGS="--container-runtime-endpoint=unix:///run/containerd/containerd.sock"' > /etc/default/kubelet
# Restart kubelet
systemctl restart kubelet
# Wait for node ready
kubectl wait --for=condition=Ready node/$NODE --timeout=300s
# Uncordon node
kubectl uncordon $NODE
echo "Node $NODE migration completed"

3.4 Common Issues and Solutions

Issue 1: Image pull failure

# Check registry config
cat /etc/containerd/config.toml | grep -A 10 registry
# Test connectivity
curl -v https://your-registry.com/v2/
# View logs
journalctl -u containerd -f

Issue 2: Network connectivity

# Verify CNI plugin
ls /opt/cni/bin/
cat /etc/cni/net.d/*.conf
# Reinstall CNI
wget https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
tar -xvf cni-plugins-linux-amd64-v1.3.0.tgz -C /opt/cni/bin/

Issue 3: GPU container failure

# Add nvidia runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"

4. Post‑Migration Operations Adjustments

4.1 Monitoring System Refactor

Prometheus job switched from Docker exporter to Containerd metrics endpoint:

# Prometheus job for Containerd
- job_name: 'containerd'
  static_configs:
  - targets: ['localhost:1338']

Enable metrics in Containerd:

# /etc/containerd/config.toml
[metrics]
address = "127.0.0.1:1338"
grpc_histogram = false

4.2 Log Collection Scheme Adjustment

Filebeat now reads logs from the pod directory instead of Docker’s path:

# Filebeat inputs
- type: log
  paths:
    - '/var/log/pods/*/*/*.log'
  symlinks: true
  processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
          - logs_path:
              logs_path: "/var/log/pods/"

4.3 CI/CD Process Refactor

Replace Docker commands with nerdctl (Docker‑compatible CLI for Containerd):

# Install nerdctl
wget https://github.com/containerd/nerdctl/releases/download/v1.5.0/nerdctl-1.5.0-linux-amd64.tar.gz
tar -xvf nerdctl-1.5.0-linux-amd64.tar.gz -C /usr/local/bin/
# Alias docker to nerdctl
alias docker='nerdctl'
# Common command mapping
# docker build   → nerdctl build
# docker run    → nerdctl run
# docker push   → nerdctl push
# docker images → nerdctl images

For image builds, use BuildKit directly:

# Run BuildKit daemon
docker run --detach --rm --privileged \
  --name buildkitd \
  --publish 1234:1234 \
  moby/buildkit:latest \
  --addr tcp://0.0.0.0:1234

# Build image with buildctl
buildctl --addr tcp://localhost:1234 build \
  --frontend dockerfile.v0 \
  --local context=. \
  --local dockerfile=. \
  --output type=image,name=registry.example.com/myapp:latest,push=true

5. Performance Optimization and Tuning

5.1 Containerd Performance Tuning Parameters

# /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri"]
max_concurrent_downloads = 10

disable_snapshot_unpack = false
discard_unpacked_layers = true

[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "native"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
NoPivotRoot = false
NoNewKeyring = false

[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"

5.2 System‑Level Optimization

# /etc/sysctl.d/99-containerd.conf
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
vm.max_map_count = 262144
vm.swappiness = 10
vm.overcommit_memory = 1
kernel.pid_max = 4194304
# /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
LimitRSS=infinity
Restart=always
RestartSec=5

5.3 Storage Driver Optimization

# Use overlayfs (recommended)
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.snapshots.overlayfs]
root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
upperdir_label = false

6. Production Case Study

6.1 Large‑Scale Cluster Migration Case

Cluster size: 500+ nodes, >15,000 Pods, 50,000+ daily image pulls, SLA 99.99%.

# Migration timeline (simplified)
Phase      | Weeks   | Work
-----------|---------|-----------------------------------
Preparation| 1‑2     | Environment assessment, design
Testing    | 3‑4     | Verify in test cluster
Gray‑scale | 5‑8     | Migrate 10% of nodes
Rollout    | 9‑12    | Migrate 50% of nodes
Finalizing |13‑14    | 100% migration completed

6.2 Migration Effect Data

Resource usage comparison:

Docker period:
- Docker daemon memory: 800 MB‑1.2 GB
- System CPU idle: ~65%
- Disk I/O wait: ~8%

Containerd period:
- Containerd memory: 200 MB‑400 MB
- System CPU idle: ~78%
- Disk I/O wait: ~3%

Container operation performance:

Operation   | Docker (s) | Containerd (s) | Improvement
------------|-----------|----------------|------------
Create       | 1.2       | 0.8            | 33%
Start       | 0.8       | 0.5            | 37%
Pull image  | 45        | 32             | 29%
Delete      | 0.5       | 0.3            | 40%

Fault recovery improvement:

Docker daemon crashes caused 15 min average recovery and affected 30‑50 Pods per node.

Containerd reduced crash rate by 90%; even if Containerd restarts, containers keep running, recovery < 1 min.

7. Pitfalls Summary and Best Practices

7.1 Ten Most Common Pitfalls

Image registry authentication errors : copying Docker’s config.json does not work; use Containerd‑specific auth sections.

CNI plugin missing : Containerd does not install CNI automatically; install and configure manually.

Log path changes : Docker logs are under /var/lib/docker; Containerd uses stdout/stderr, typically /var/log/pods/.

crictl command confusion : crictl is a debugging tool, not a Docker replacement; use nerdctl for production commands.

cgroup driver mismatch : ensure both Containerd and kubelet use systemd cgroup driver.

Private image pull failures : configure registry.mirrors and auth correctly.

Pause container image missing : use a compatible pause image from a trusted registry.

Missing monitoring metrics : replace docker_exporter with Containerd metrics endpoint.

GPU support : configure nvidia‑container‑runtime manually; it is not auto‑detected.

Rollback plan absent : keep Docker data for at least a month to allow rollback.

7.2 Best Practice Recommendations

1. Define a detailed rollback plan:

#!/bin/bash
NODE=$1
kubectl drain $NODE --ignore-daemonsets
systemctl stop kubelet
systemctl stop containerd
apt-get install -y docker-ce docker-ce-cli
systemctl start docker
sed -i 's/--container-runtime=remote/--container-runtime=docker/g' /var/lib/kubelet/kubeadm-flags.env
rm /etc/default/kubelet
systemctl restart kubelet
kubectl uncordon $NODE

2. Build comprehensive monitoring alerts:

# Prometheus alert rules
groups:
- name: containerd
  rules:
  - alert: ContainerdDown
    expr: up{job="containerd"} == 0
    for: 5m
    annotations:
      summary: "Containerd is down on {{ $labels.instance }}"
  - alert: ContainerdHighMemory
    expr: process_resident_memory_bytes{job="containerd"} > 1073741824
    for: 10m
    annotations:
      summary: "Containerd memory usage is too high"

3. Standardize operational procedures (SOPs): daily health checks, fault‑troubleshooting flow, performance tuning guide, upgrade manual.

4. Continuous performance monitoring:

#!/bin/bash
while true; do
  echo "=== $(date) ===" >> /var/log/containerd-perf.log
  ps aux | grep containerd | grep -v grep >> /var/log/containerd-perf.log
  crictl ps | wc -l >> /var/log/containerd-perf.log
  time crictl ps > /dev/null 2>> /var/log/containerd-perf.log
  sleep 60
done

8. Future Outlook: Evolution of Cloud‑Native Runtimes

8.1 Industry Trend Analysis

Standardization : OCI and CRI standards enable interchangeable runtimes.

Lightweight : removing unnecessary components focuses on core capabilities.

Cloud‑Native : tighter integration with Kubernetes and other orchestration systems.

Security : smaller attack surface reduces risk.

8.2 Emerging Technologies

1. Kata Containers – secure container runtime:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"

2. gVisor – application‑level kernel isolation:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"

3. Wasm containers – next‑generation lightweight runtimes (WasmEdge, Wasmtime, Spin).

8.3 Advice for Operations Engineers

Continuous learning : container technology evolves rapidly; keep up‑to‑date.

Practice‑first : build test environments and experiment hands‑on.

Community involvement : join CNCF and related projects for first‑hand information.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetescontainerdDocker migration
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.