Why Did My Kubernetes Pods Stop Communicating? Uncovering CoreDNS ReadinessProbe Failures
A sudden Kubernetes outage was traced to a CoreDNS ReadinessProbe failure on port 8181, and the article walks through symptom detection, code‑level root‑cause analysis, and a ConfigMap fix that restores inter‑pod communication.
1. Incident Background
A frantic early‑morning call reported that all pod‑to‑pod calls in a Kubernetes cluster had stopped, prompting an urgent investigation of CoreDNS logs that showed domain resolution errors.
2. Symptom Observation
Checking the CoreDNS pods revealed that ports 8080 and 9153 responded normally, while port 8181 was unreachable.
All CoreDNS pods were stuck in a 0/1 Running state, and
kubectl describeshowed a ReadinessProbe failure.
<code>Readiness probe failed: Get "http://x.x.x.x:8181/ready": dial tcp x.x.x.x:8181: connect: connection refused</code>The ReadinessProbe determines whether a pod can be added to a Service endpoint; failure leaves the Service endpoint empty, breaking inter‑pod communication.
3. Root Cause Analysis
Reviewing the CoreDNS source revealed that the
readyplugin registers an HTTP server on port 8181. The
parsefunction may receive an empty or malformed address, causing the probe to fail when the plugin is not explicitly configured.
<code>// Register ready plugin
func init() { plugin.Register("ready", setup) }
func setup(c *caddy.Controller) error {
addr, err := parse(c)
if err != nil { return plugin.Error("ready", err) }
rd := &ready{Addr: addr}
uniqAddr.Set(addr, rd.onStartup)
c.OnStartup(func() error { uniqAddr.Set(addr, rd.onStartup); return nil })
// ... omitted ...
return nil
}
func parse(c *caddy.Controller) (string, error) {
addr := ":8181"
// ...
if len(args) == 1 {
addr = args[0]
if _, _, e := net.SplitHostPort(addr); e != nil { return "", e }
}
// ...
return addr, nil
}
</code>The ready component had never been added to the CoreDNS ConfigMap, so the default address caused the probe to fail.
4. Fix Method
Manually add the ready plugin with the correct port to the CoreDNS ConfigMap and restart the pods:
<code>apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
ready :8181 # added line
forward . /etc/resolv.conf {
max_concurrent 1000
}
log
cache 30
loop
reload
loadbalance
}
</code>After applying the ConfigMap and restarting CoreDNS, the ReadinessProbe succeeded and services recovered.
5. Result
CoreDNS pods now run normally, the ReadinessProbe passes, and the business workload is fully restored.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.