Cloud Native 16 min read

Why Does Containerd’s PLEG Relisting Stall at Node Startup and How to Fix It

When replacing dockershim with containerd, we observed that pods take over a minute to start because the GenericPLEG Relisting operation stalls for more than 30 seconds during node boot, caused by containerd’s UpdateContainerResources holding a bbolt lock and intensive image pulls; the article explains the root cause and provides a fix using the overlay volatile mount option.

Efficient Ops
Efficient Ops
Efficient Ops
Why Does Containerd’s PLEG Relisting Stall at Node Startup and How to Fix It

Technical Background

In recent internal tests of replacing dockershim with containerd, we noticed that business containers take a long time to become runnable after the pod starts. The init container finishes within a second, but the main containers sometimes need more than a minute before they start executing.

Examining kubelet logs revealed that, when a node first boots, the PLEG (Pod Lifecycle Event Generator)

Relisting

method—normally executed once per second—takes over 30 seconds to complete. After a few minutes the issue disappears and

Relisting

runs at the expected one‑second interval.

dockershim and CRI

Kubernetes 1.24 removed the dockershim component from kubelet, allowing users to choose containerd or CRI‑O as the container runtime. Containerd’s architecture evolved accordingly.

PLEG

PLEG (Pod Lifecycle Event Generator) runs on each node to keep the actual state of pods and containers in sync with the desired

spec

. It reduces unnecessary work during idle periods and lowers the number of concurrent requests to the container runtime.

Pod spec state

Container runtime state

ImagePull Process

The steps performed by

ctr image pull

are:

Resolve the image to be downloaded.

Pull the image from the registry, storing layers and config in the content service and metadata in the images service.

Unpack the layers into the snapshot service.

Note: the content and images services are gRPC services provided by containerd; during layer unpacking containerd temporarily mounts and unmounts all parent snapshots.

Problem Diagnosis

Based on the background, the

GenericPLEG: Relisting

call queries containerd’s CRI to obtain the list of running containers. Containerd logs show errors such as:

<code>containerd[2206]: {"error":"failed to stop container: failed to delete task: context deadline exceeded: unknown","level":"error","msg":"failed to handle container TaskExit event &amp;TaskExit{ContainerID:...}"}</code>

Goroutine dumps reveal a goroutine waiting on a

Delete

call, and another stuck in an

umount

system call.

<code>goroutine 1654 [select]:
github.com/containerd/ttrpc.(*Client).dispatch(...)
... (stack trace omitted for brevity)</code>

Further investigation of

containerd.log

shows that

UpdateContainerResources

requests are blocked waiting for a bbolt lock:

<code>goroutine 1723 [semacquire]:
sync.runtime_SemacquireMutex(...)
... (stack trace omitted)</code>

The relevant source code resides in

containerd/pkg/cri/server/container_update_resources.go

and holds the container status lock while updating resources. The

ListContainers

operation also needs this lock, causing PLEG to stall.

Because the lock is held while the bbolt database syncs data to storage, I/O pressure on the host can exacerbate the delay. Monitoring tools such as PSI or

iostat

can surface the pressure.

Problem Fix

The community provided a fix in PR #8676: add a mount option

volatile

to the overlay filesystem. This option skips the sync call during

umount

, preventing the long pause.

Note: the volatile mount option allows overlayfs to avoid forced disk sync on unmount, reducing latency.

Applying the overlay

volatile

option mitigates the startup delay even when many images are pulled.

Disclaimer: the author’s time and perspective are limited; readers are encouraged to provide feedback and corrections.

KubernetesTroubleshootingcontainerdcontainer runtimePLEG
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.