How to Observe and Diagnose DNS Failures in Kubernetes Clusters
This article explains how DNS operates inside Kubernetes, enumerates common failure causes, describes CoreDNS's built‑in observability plugins, introduces BPF‑based client‑side diagnostics, and provides a step‑by‑step troubleshooting workflow to identify and resolve DNS issues in cloud‑native environments.
01 Introduction
This topic introduces methods for achieving DNS failure observability and diagnosing problems in Kubernetes clusters. It covers DNS fundamentals, typical failure reasons, server‑side and client‑side diagnostic techniques, and a practical case study.
02 How DNS Works in Kubernetes
DNS is pervasive in a Kubernetes cluster: service discovery for micro‑services, node discovery for distributed databases, and communication between logging/monitoring components and the API server all rely on DNS. When a web app connects to a database, the following steps occur:
The application is configured with database:6379 as the connection string.
On startup, the app queries the DNS server for the actual IP of database.
The DNS server returns the ClusterIP of the database Service.
The request reaches the Service IP, IPVS performs DNAT, and traffic is forwarded to the real database pod (e.g., 172.20.1.7).
The pod’s /etc/resolv.conf points to the kube‑dns ClusterIP (e.g., 172.21.0.10), which is served by CoreDNS replicas behind an IPVS load balancer. The search path and ndots settings affect how short names are expanded to fully qualified domain names (FQDNs). Misconfiguration can dramatically increase resolution latency.
03 Common DNS Failure Root Causes
Large blast radius: almost every data‑plane and control‑plane component depends on DNS.
Long request chain: DNS queries traverse kernel IPVS, iptables, and other modules.
High QPS: frequent application queries and search‑path expansion multiply request volume.
Operational experience shows that DNS failures arise from many different sources, including CPU bottlenecks, Conntrack table exhaustion, source‑port reuse races, IPVS backend changes, and CoreDNS software bugs.
CPU Limits
CoreDNS QPS scales with CPU. If a CoreDNS replica exceeds one CPU core during peak load, latency or failures appear. Recommended practice: keep each replica’s peak CPU below one core and scale replicas roughly at a ratio of one CoreDNS pod per eight cluster nodes.
Conntrack Table Limits
Each TCP/UDP connection consumes a Conntrack entry. High short‑lived request rates can fill the table, causing “Conntrack table full” errors. Expanding the table or using longer‑lived connections mitigates the issue. Similar limits exist for ARP tables and socket counts.
Source‑Port Reuse Race Conditions
Containers based on Alpine’s Musl library may reuse the same source port for concurrent A and AAAA queries, triggering a kernel Conntrack race that drops DNS packets.
IPVS Backend Changes
When IPVS reloads backends or a CoreDNS replica restarts, packets that reuse a recently used source port can be dropped, leading to resolution delays. Kernel upgrades can resolve these races.
CoreDNS Software Bugs
Older CoreDNS versions suffer from panics on API server disconnections, AutoPath plugin crashes, and other intermittent failures. Upgrading to the latest stable release is strongly advised.
04 CoreDNS Built‑in Observability Capabilities
CoreDNS’s plugin architecture provides rich observability. The most useful plugins are:
Log
Logs each request with domain, source IP, and response code, similar to Nginx access logs. Logs can be shipped to external systems for trend analysis and alerting.
Dump
Prints a line when CoreDNS receives a client request, useful when the Log plugin shows no output.
Debug
Outputs full DNS packets in hexadecimal when network anomalies or upstream errors occur, facilitating Wireshark analysis.
DNSTap
Exports binary logs of DNS traffic to a remote dnstap server, enabling low‑overhead collection, storage, and anomaly detection. The server can parse RCODE fields and message types to classify failures.
Trace
Implements OpenTracing, recording the lifecycle of a request across plugins. This helps pinpoint latency hotspots.
Prometheus
Exposes CoreDNS metrics (RCODE trends, QPS, etc.) for scraping by Prometheus and visualization in Grafana. Thresholds can trigger real‑time alerts.
05 BPF‑Based Client‑Side DNS Anomaly Detection
Many DNS failures occur on the client side, where CoreDNS never sees the request. Traditional packet captures are costly for low‑frequency issues. BPF tools like trace_dns_drops.bt monitor kernel functions that drop DNS packets, printing the source IP and port when a drop is detected. The tool can be extended to extract DNS query types and names. Other community tools such as BCC’s gethostlatency.py and Cilium’s pwru also aid in kernel‑level DNS diagnostics.
06 How to Respond to a DNS Failure
Clarify the Problem
Avoid premature conclusions; DNS resolution paths in Kubernetes are long.
Identify the exact error (e.g., connection refused, NXDOMAIN, timeout).
Verify domain spelling and that Pod DNS and CoreDNS configurations match expectations.
Check recent changes (node expansions, security‑group updates, etc.).
Collect Information
Gather logs from the client pod, the node it runs on, CoreDNS pods, and the API server.
Inspect Kubernetes events, node metrics, and any traffic spikes.
Validate Hypotheses
Run dig from the client pod, the node, and other pods to test DNS resolution paths.
Enable CoreDNS observability plugins or run BPF/tcpdump to confirm packet loss or latency.
Fix the Issue
Apply a solution targeting the identified root cause.
Roll out changes incrementally with rollback plans.
Verify recovery through comprehensive testing and dashboard monitoring.
During investigation, ask about failure frequency (continuous, peak‑time, intermittent) and scope (whole cluster, specific pods, random nodes). These answers narrow down potential causes.
07 Summary
The article covered CoreDNS server and client‑side DNS failure causes, observability plugins, BPF‑based detection, and a systematic troubleshooting workflow. DNS reliability is critical for Kubernetes; designing for stability (e.g., local caching) and proactive observability are essential.
08 References
DNS Best Practices: https://help.aliyun.com/document_detail/172339.html
PPT Download: https://kccncosschn21.sched.com/event/pcag/huan-daepkubernetes-zhong-shi-dns-dun-zha-qi-tizhong-best-practice-dns-failure-observability-and-diagnosis-in-kubernetes-yuning-xie-alibaba
/etc/resolv.conf Documentation: https://man7.org/linux/man-pages/man5/resolv.conf.5.html
Log Dashboard Example: https://help.aliyun.com/document_detail/213461.html
Problem Bounding Document: https://help.aliyun.com/document_detail/268638.html
BPF Example Tool (trace_dns_drops.bt): https://gist.github.com/xh4n3/61d8081b834d7e21bff723614e07777c
DNS Issue Troubleshooting Guide: https://help.aliyun.com/document_detail/404754.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
