Cloud Native 14 min read

Why Do Readiness Probe Failures Show “OCI runtime exec failed: EOF” in Kubernetes?

A Kubernetes pod reported readiness probe warnings with an OCI runtime exec failure, which was traced through kubelet, Docker, dockershim, containerd, and runc, ultimately caused by a race condition with cpu‑manager updating the container state file, and resolved by disabling cpu‑manager or upgrading runc.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Why Do Readiness Probe Failures Show “OCI runtime exec failed: EOF” in Kubernetes?

Introduction

Problem investigation process, source code part recorded by a developer colleague; published with consent.

Problem

Customer reported many warning events:

Readiness probe failed: OCI runtime exec failed: exec failed: EOF: unknown

, but the service remained accessible.

Environment

Note: the customer enabled cpu-manager on the k8s node running the workload.

Component

Version

k8s

1.14.x

Investigation

1. After receiving the feedback, check the kubelet logs on the node where the pod runs:

I0507 03:43:28.310630 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(d1aab5f0-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: EOF: unknown
I0507 07:08:49.834093 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(a89a158e-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: unexpected EOF: unknown
I0507 10:06:58.307881 57003 prober.go:112] Readiness probe for "adsfadofadfabdfhaodsfa(d1aab5f0-ae8f-11eb-a151-080027049c65):c0" failed (failure): OCI runtime exec failed: exec failed: EOF: unknown

The probe error type is failure, corresponding code is shown:

probe error code
probe error code

2. Check Docker logs:

time="2021-05-06T16:51:40.009989451+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2021-05-06T16:51:40.010054596+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2021-05-06T16:51:40.170676532+08:00" level=error msg="Error running exec 8e34e8b910694abe95a467b2936b37635fdabd2f7b7c464dfef952fa5732aa4e in container: OCI runtime exec failed: exec failed: EOF: unknown"

Although Docker logs show a stream copy error, the underlying runc returned EOF, causing the error. Because the probe type is Failure, e.CombinedOutPut() returns a non‑nil error and a non‑zero exit status, which leads to a call to ExecInContainer.

ExecInContainer flow
ExecInContainer flow
ExecSync via dockershim
ExecSync via dockershim
dockershim ExecInContainer
dockershim ExecInContainer
ExecInContainer

implementation (excerpt):

func (*NativeExecHandler) ExecInContainer(client libdocker.Interface, container *dockertypes.ContainerJSON, cmd []string, stdin io.Reader, stdout, stderr io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize, timeout time.Duration) error {
    execObj, err := client.CreateExec(container.ID, createOpts)
    startOpts := dockertypes.ExecStartCheck{Detach: false, Tty: tty}
    streamOpts := libdocker.StreamOptions{InputStream: stdin, OutputStream: stdout, ErrorStream: stderr, RawTerminal: tty, ExecStarted: execStarted}
    err = client.StartExec(execObj.ID, startOpts, streamOpts)
    if err != nil { return err }
    // poll for completion
    ticker := time.NewTicker(2 * time.Second)
    defer ticker.Stop()
    for {
        inspect, err2 := client.InspectExec(execObj.ID)
        if err2 != nil { return err2 }
        if !inspect.Running {
            if inspect.ExitCode != 0 { err = &dockerExitError{inspect} }
            break
        }
        <-ticker.C
    }
    return err
}

ExecInContainer performs three main steps:

Call CreateExec to create an ExecID.

Call StartExec to run the exec and redirect I/O.

Call InspectExec to obtain the running status and exit code.

The error printed in the logs is the response stream from dockerd, i.e., dockerd’s response contains the error.

dockerd error handling
dockerd error handling

Further tracing shows that ExecStart eventually calls containerd code, which invokes runc. The runc exec fails with exec failed: EOF: unknown.

runc execution path
runc execution path

Repeated execution of runc reproduces the issue sporadically. Investigation revealed that runc reads the container’s state.json. When the kubelet cpu‑manager updates the container (default every 10 s), it writes to state.json concurrently, causing a partial write. The JSON decoder then encounters an unexpected EOF.

state.json race condition
state.json race condition

A related runc PR fixes the problem by making saveState an atomic operation.

// original saveState
func (c *linuxContainer) saveState(s *State) error {
    f, err := os.Create(filepath.Join(c.root, stateFilename))
    if err != nil { return err }
    defer f.Close()
    return utils.WriteJSON(f, s)
}

// fixed saveState
func (c *linuxContainer) saveState(s *State) (retErr error) {
    tmpFile, err := ioutil.TempFile(c.root, "state-")
    if err != nil { return err }
    defer func() {
        if retErr != nil {
            tmpFile.Close()
            os.Remove(tmpFile.Name())
        }
    }()
    err = utils.WriteJSON(tmpFile, s)
    if err != nil { return err }
    err = tmpFile.Close()
    if err != nil { return err }
    stateFilePath := filepath.Join(c.root, stateFilename)
    return os.Rename(tmpFile.Name(), stateFilePath)
}

Solution

Disable cpu‑manager.

Upgrade runc to a version containing the above fix.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetescontainerdCPU ManagerreadinessProbeOCI runtime
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.