Cloud Native 18 min read

How to Observe and Diagnose DNS Failures in Kubernetes Clusters

This article explains how DNS operates inside Kubernetes, enumerates common failure causes, describes CoreDNS's built‑in observability plugins, introduces BPF‑based client‑side diagnostics, and provides a step‑by‑step troubleshooting workflow to identify and resolve DNS issues in cloud‑native environments.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Observe and Diagnose DNS Failures in Kubernetes Clusters

01 Introduction

This topic introduces methods for achieving DNS failure observability and diagnosing problems in Kubernetes clusters. It covers DNS fundamentals, typical failure reasons, server‑side and client‑side diagnostic techniques, and a practical case study.

02 How DNS Works in Kubernetes

DNS is pervasive in a Kubernetes cluster: service discovery for micro‑services, node discovery for distributed databases, and communication between logging/monitoring components and the API server all rely on DNS. When a web app connects to a database, the following steps occur:

The application is configured with database:6379 as the connection string.

On startup, the app queries the DNS server for the actual IP of database.

The DNS server returns the ClusterIP of the database Service.

The request reaches the Service IP, IPVS performs DNAT, and traffic is forwarded to the real database pod (e.g., 172.20.1.7).

The pod’s /etc/resolv.conf points to the kube‑dns ClusterIP (e.g., 172.21.0.10), which is served by CoreDNS replicas behind an IPVS load balancer. The search path and ndots settings affect how short names are expanded to fully qualified domain names (FQDNs). Misconfiguration can dramatically increase resolution latency.

03 Common DNS Failure Root Causes

Large blast radius: almost every data‑plane and control‑plane component depends on DNS.

Long request chain: DNS queries traverse kernel IPVS, iptables, and other modules.

High QPS: frequent application queries and search‑path expansion multiply request volume.

Operational experience shows that DNS failures arise from many different sources, including CPU bottlenecks, Conntrack table exhaustion, source‑port reuse races, IPVS backend changes, and CoreDNS software bugs.

CPU Limits

CoreDNS QPS scales with CPU. If a CoreDNS replica exceeds one CPU core during peak load, latency or failures appear. Recommended practice: keep each replica’s peak CPU below one core and scale replicas roughly at a ratio of one CoreDNS pod per eight cluster nodes.

Conntrack Table Limits

Each TCP/UDP connection consumes a Conntrack entry. High short‑lived request rates can fill the table, causing “Conntrack table full” errors. Expanding the table or using longer‑lived connections mitigates the issue. Similar limits exist for ARP tables and socket counts.

Source‑Port Reuse Race Conditions

Containers based on Alpine’s Musl library may reuse the same source port for concurrent A and AAAA queries, triggering a kernel Conntrack race that drops DNS packets.

IPVS Backend Changes

When IPVS reloads backends or a CoreDNS replica restarts, packets that reuse a recently used source port can be dropped, leading to resolution delays. Kernel upgrades can resolve these races.

CoreDNS Software Bugs

Older CoreDNS versions suffer from panics on API server disconnections, AutoPath plugin crashes, and other intermittent failures. Upgrading to the latest stable release is strongly advised.

04 CoreDNS Built‑in Observability Capabilities

CoreDNS’s plugin architecture provides rich observability. The most useful plugins are:

Log

Logs each request with domain, source IP, and response code, similar to Nginx access logs. Logs can be shipped to external systems for trend analysis and alerting.

Dump

Prints a line when CoreDNS receives a client request, useful when the Log plugin shows no output.

Debug

Outputs full DNS packets in hexadecimal when network anomalies or upstream errors occur, facilitating Wireshark analysis.

DNSTap

Exports binary logs of DNS traffic to a remote dnstap server, enabling low‑overhead collection, storage, and anomaly detection. The server can parse RCODE fields and message types to classify failures.

Trace

Implements OpenTracing, recording the lifecycle of a request across plugins. This helps pinpoint latency hotspots.

Prometheus

Exposes CoreDNS metrics (RCODE trends, QPS, etc.) for scraping by Prometheus and visualization in Grafana. Thresholds can trigger real‑time alerts.

05 BPF‑Based Client‑Side DNS Anomaly Detection

Many DNS failures occur on the client side, where CoreDNS never sees the request. Traditional packet captures are costly for low‑frequency issues. BPF tools like trace_dns_drops.bt monitor kernel functions that drop DNS packets, printing the source IP and port when a drop is detected. The tool can be extended to extract DNS query types and names. Other community tools such as BCC’s gethostlatency.py and Cilium’s pwru also aid in kernel‑level DNS diagnostics.

06 How to Respond to a DNS Failure

Clarify the Problem

Avoid premature conclusions; DNS resolution paths in Kubernetes are long.

Identify the exact error (e.g., connection refused, NXDOMAIN, timeout).

Verify domain spelling and that Pod DNS and CoreDNS configurations match expectations.

Check recent changes (node expansions, security‑group updates, etc.).

Collect Information

Gather logs from the client pod, the node it runs on, CoreDNS pods, and the API server.

Inspect Kubernetes events, node metrics, and any traffic spikes.

Validate Hypotheses

Run dig from the client pod, the node, and other pods to test DNS resolution paths.

Enable CoreDNS observability plugins or run BPF/tcpdump to confirm packet loss or latency.

Fix the Issue

Apply a solution targeting the identified root cause.

Roll out changes incrementally with rollback plans.

Verify recovery through comprehensive testing and dashboard monitoring.

During investigation, ask about failure frequency (continuous, peak‑time, intermittent) and scope (whole cluster, specific pods, random nodes). These answers narrow down potential causes.

07 Summary

The article covered CoreDNS server and client‑side DNS failure causes, observability plugins, BPF‑based detection, and a systematic troubleshooting workflow. DNS reliability is critical for Kubernetes; designing for stability (e.g., local caching) and proactive observability are essential.

08 References

DNS Best Practices: https://help.aliyun.com/document_detail/172339.html

PPT Download: https://kccncosschn21.sched.com/event/pcag/huan-daepkubernetes-zhong-shi-dns-dun-zha-qi-tizhong-best-practice-dns-failure-observability-and-diagnosis-in-kubernetes-yuning-xie-alibaba

/etc/resolv.conf Documentation: https://man7.org/linux/man-pages/man5/resolv.conf.5.html

Log Dashboard Example: https://help.aliyun.com/document_detail/213461.html

Problem Bounding Document: https://help.aliyun.com/document_detail/268638.html

BPF Example Tool (trace_dns_drops.bt): https://gist.github.com/xh4n3/61d8081b834d7e21bff723614e07777c

DNS Issue Troubleshooting Guide: https://help.aliyun.com/document_detail/404754.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

observabilityKubernetestroubleshootingDNSBPFCoreDNS
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.