Cloud Native 8 min read

Why Did My Kubernetes Pods Stop Communicating? Uncovering CoreDNS ReadinessProbe Failures

A sudden Kubernetes outage was traced to a CoreDNS ReadinessProbe failure on port 8181, and the article walks through symptom detection, code‑level root‑cause analysis, and a ConfigMap fix that restores inter‑pod communication.

Efficient Ops
Efficient Ops
Efficient Ops
Why Did My Kubernetes Pods Stop Communicating? Uncovering CoreDNS ReadinessProbe Failures

1. Incident Background

A frantic early‑morning call reported that all pod‑to‑pod calls in a Kubernetes cluster had stopped, prompting an urgent investigation of CoreDNS logs that showed domain resolution errors.

2. Symptom Observation

Checking the CoreDNS pods revealed that ports 8080 and 9153 responded normally, while port 8181 was unreachable.

All CoreDNS pods were stuck in a 0/1 Running state, and

kubectl describe

showed a ReadinessProbe failure.

<code>Readiness probe failed: Get "http://x.x.x.x:8181/ready": dial tcp x.x.x.x:8181: connect: connection refused</code>

The ReadinessProbe determines whether a pod can be added to a Service endpoint; failure leaves the Service endpoint empty, breaking inter‑pod communication.

3. Root Cause Analysis

Reviewing the CoreDNS source revealed that the

ready

plugin registers an HTTP server on port 8181. The

parse

function may receive an empty or malformed address, causing the probe to fail when the plugin is not explicitly configured.

<code>// Register ready plugin
func init() { plugin.Register("ready", setup) }
func setup(c *caddy.Controller) error {
    addr, err := parse(c)
    if err != nil { return plugin.Error("ready", err) }
    rd := &ready{Addr: addr}
    uniqAddr.Set(addr, rd.onStartup)
    c.OnStartup(func() error { uniqAddr.Set(addr, rd.onStartup); return nil })
    // ... omitted ...
    return nil
}
func parse(c *caddy.Controller) (string, error) {
    addr := ":8181"
    // ...
    if len(args) == 1 {
        addr = args[0]
        if _, _, e := net.SplitHostPort(addr); e != nil { return "", e }
    }
    // ...
    return addr, nil
}
</code>

The ready component had never been added to the CoreDNS ConfigMap, so the default address caused the probe to fail.

4. Fix Method

Manually add the ready plugin with the correct port to the CoreDNS ConfigMap and restart the pods:

<code>apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        ready :8181   # added line
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        log
        cache 30
        loop
        reload
        loadbalance
    }
</code>

After applying the ConfigMap and restarting CoreDNS, the ReadinessProbe succeeded and services recovered.

5. Result

CoreDNS pods now run normally, the ReadinessProbe passes, and the business workload is fully restored.

cloud nativeKubernetesTroubleshootingCoreDNSReadinessProbe
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.