Cloud Native 18 min read

Redesigning Kubernetes DNS Architecture with q-dnsmasq for Improved Reliability and Performance

This article details the motivation, design, implementation, testing, and rollout of a refactored Kubernetes DNS solution that replaces the default kube-dns → CoreDNS chain with a node‑local q‑dnsmasq cache and parallel upstream queries to achieve higher availability, faster resolution, and better cache hit rates in large‑scale clusters.

Qunar Tech Salon

Jun 21, 2024

Redesigning Kubernetes DNS Architecture with q-dnsmasq for Improved Reliability and Performance

Problem Background DNS is a critical service in Kubernetes clusters, and misconfigurations or large cluster sizes often cause timeouts and failures that can impact business services. Node failures or high load on CoreDNS nodes exacerbate the issue, prompting a redesign of the native DNS architecture.

Solution Comparison The native approach routes pod → kube-dns service → CoreDNS pod, which suffers from large fault domains, limited cache hit rates, and slow fallback when the primary DNS is unavailable. The refactored approach introduces a node‑local q-dnsmasq service that pods query first; if unavailable, they fall back to the kube‑dns service. q-dnsmasq runs in all‑servers mode, forwarding internal queries to multiple CoreDNS instances and external queries to multiple local DNS servers, selecting the fastest response.

Native Solution Drawbacks 1) Large fault domain affecting many pods simultaneously; 2) Single ClusterFirst DNS policy with only one kube‑dns IP causing long retry delays; 3) Low cache hit rate due to distributed pod‑to‑service routing.

Refactored Architecture Diagram (image omitted). The diagram shows IDC Pods and Cloud Pods, the node‑local q-dnsmasq, host‑networked CoreDNS, and corporate LocalDNS servers.

Resolution Flow For internal K8s domains: Pod → q-dnsmasq (cache) → CoreDNS:xx53 & kube-dns:53. For external domains: Pod → q-dnsmasq (cache) → all LocalDNS:53. If q-dnsmasq fails, the flow falls back to Pod → kube-dns:53 → CoreDNS:xx53 (cache) and then to LocalDNS.

Advantages of the Refactored Scheme 1) Isolated fault domains – only pods on the affected node lose DNS service, with a secondary nameserver as backup. 2) Faster resolution – parallel queries to multiple CoreDNS or LocalDNS servers. 3) Higher cache hit rate – each node’s DNS cache serves its pods, reducing duplicate lookups. 4) Simplified two‑hop chain with built‑in fallback. 5) Consistent first‑choice nameserver with concurrent queries.

Why dnsmasq? Existing q‑dnsmasq instances on physical servers already provide load relief; dnsmasq’s --all-servers option improves performance and reliability, and it meets business requirements for fast, reliable responses.

Testing Records Configuration files ( /etc/q-dnsmasq.conf, /etc/dnsmasq.d/q-ns.conf, /etc/q-kubedns.servers) and kubelet config were verified. Tests included pod scheduling to modified nodes, checking /etc/resolv.conf, using dig for internal and external domains, cache validation, and failure scenarios where q-dnsmasq is stopped.

Test Scenarios 1) Verify pod DNS points to node‑local q-dnsmasq. 2) Confirm correct resolution of internal ( kubernetes.default.svc) and external ( www.qunar.com) domains. 3) Observe cache behavior with rapid successive queries. 4) Validate node‑level resolution and fallback to LocalDNS when q-dnsmasq is down.

Rollout Plan Steps include configuring q-dnsmasq service scripts, updating resolv.conf on nodes and pods, restarting services, deploying a CoreDNS DaemonSet in hostNetwork mode listening on xx53, scaling down the existing CoreDNS Deployment, and editing the kube-dns Service to point to the new port. After migration, the architecture fully relies on node‑local DNS caching and the DaemonSet CoreDNS.

Summary and Future Work The refactored DNS design improves reliability, latency, and cache efficiency for large‑scale clusters. Future plans address cloud‑native environments where DaemonSets cannot run, such as packaging q-dnsmasq as a sidecar or using annotation‑driven configuration for hybrid on‑prem/cloud deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance cloud-native Kubernetes Reliability DNS CoreDNS dnsmasq

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.