Cloud Native 12 min read

Why Docker exec Fails: Diagnosing runc Errors and Resource Limits

This guide walks through a real‑world Docker exec failure, explains the relationship between kubelet, docker‑shim, containerd, and runc, shows step‑by‑step commands to isolate the faulty component, and reveals that a resource‑limit (pids) exhaustion in the container caused the runc exec error.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Why Docker exec Fails: Diagnosing runc Errors and Resource Limits

Learning Focus

Relationship among kubelet, docker‑shim, dockerd, containerd, containerd‑shim, and runc

Investigation method: using docker, containerd‑ctr, and docker‑runc to connect to containers

runc execution workflow

Problem Description

During on‑call troubleshooting, the system log repeatedly showed Docker exec errors:

May 12 09:08:40 HOSTNAME dockerd[4085]: time="2021-05-12T09:08:40.642410594+08:00" level=error msg="stream copy error: reading from a closed fifo"
May 12 09:08:40 HOSTNAME dockerd[4085]: time="2021-05-12T09:08:40.642418571+08:00" level=error msg="stream copy error: reading from a closed fifo"
May 12 09:08:40 HOSTNAME dockerd[4085]: time="2021-05-12T09:08:40.663754355+08:00" level=error msg="Error running exec ...: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused \"read init-p: connection reset by peer\": unknown"

The logs indicate a failure in the exec path that must be investigated for potential business impact.

Investigation Steps

We only know that dockerd reported an exec failure. Before diving deeper, we review Docker's call chain.

Docker call chain
Docker call chain

The chain is long and involves many components, so we split the investigation into two parts:

Identify the component that triggers the failure

Determine why that component failed

Identifying the Faulty Component

Even experienced Docker users can quickly spot the culprit, but we follow a systematic approach:

# 1. Locate the problematic container
$ sudo docker ps | grep -v pause | grep -v NAMES | awk '{print $1}' | xargs -ti sudo docker exec {} sleep 1
sudo docker exec aa1e331ec24f sleep 1
OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer": unknown

# 2. Exclude Docker itself
$ docker-containerd-ctr -a /var/run/docker/containerd/docker-containerd.sock -n moby t exec --exec-id stupig1 aa1e331ec24f621ab3152ebe94f1e533734164af86c9df0f551eab2b1967ec4e sleep 1
ctr: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer": unknown

# 3. Exclude containerd and containerd‑shim
$ docker-runc --root /var/run/docker/runtime-runc/moby/ exec aa1e331ec24f621ab3152ebe94f1e533734164af86c9df0f551eab2b1967ec4e sleep
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
... (stack trace omitted for brevity) ...
exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer"

The error originates from runc.

Finding the Root Cause

runc provides a detailed error message: Resource temporarily unavailable, which typically points to a resource‑limit issue. Common limits (shown by ulimit -a) include thread count, file descriptor count, and memory.

Thread count limit reached

File descriptor limit reached

Memory limit reached

We therefore inspect the container's resource usage.

Business container thread count
Business container thread count

The monitoring chart shows that all containers have reached 10,000 threads, which is the default limit for Elastic Cloud containers. This limit is intentionally set to prevent a single container from exhausting host thread resources.

$ cat /sys/fs/cgroup/pids/kubepods/burstable/pod64a6c0e7-830c-11eb-86d6-b8cef604db88/aa1e331ec24f621ab3152ebe94f1e533734164af86c9df0f551eab2b1967ec4e/pids.max
10000

Thus, the exec failure is caused by the container hitting its pids (thread) limit.

runc Workflow Overview

Even after pinpointing the error, understanding runc's internal workflow helps prevent similar issues.

Using runc exec as an example, the process is:

runc exec starts a child process runc init runc init prepares the container namespace

runc exec adds the grandchild to the container's cgroup

runc exec sends the exec configuration (command, args, env) to the grandchild

The grandchild calls system.Execv to run the user command

Notes:

Steps 2.c and 3 run concurrently

Communication between runc exec and runc init uses a socket pair (init‑p / init‑c)

runc execution flow
runc execution flow

Relevant runc Source Code

func (p *setnsProcess) start() (err error) {
    defer p.parentPipe.Close()
    err = p.cmd.Start()
    p.childPipe.Close()
    if err != nil {
        return newSystemErrorWithCause(err, "starting setns process")
    }
    if p.bootstrapData != nil {
        if _, err := io.Copy(p.parentPipe, p.bootstrapData); err != nil {
            return newSystemErrorWithCause(err, "copying bootstrap data to pipe")
        }
    }
    if err = p.execSetns(); err != nil {
        return newSystemErrorWithCause(err, "executing setns process")
    }
    if len(p.cgroupPaths) > 0 {
        if err := cgroups.EnterPid(p.cgroupPaths, p.pid()); err != nil {
            return newSystemErrorWithCausef(err, "adding pid %d to cgroups", p.pid())
        }
    }
    if err := utils.WriteJSON(p.parentPipe, p.config); err != nil {
        return newSystemErrorWithCause(err, "writing config to pipe")
    }
    ierr := parseSync(p.parentPipe, func(sync *syncT) error {
        switch sync.Type {
        case procReady:
            panic("unexpected procReady in setns")
        case procHooks:
            panic("unexpected procHooks in setns")
        default:
            return newSystemError(fmt.Errorf("invalid JSON payload from child"))
        }
    })
    if ierr != nil {
        p.wait()
        return ierr
    }
    return nil
}

With the workflow and code analysis, we have fully identified the cause of the exec failure.

References

https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt

https://github.com/opencontainers/runc

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetescgroupscontainerdruncresource-limits
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.