Why DNS Lookups Fail and How to Fix Them: A Complete Troubleshooting Guide
This guide explains the DNS resolution process, categorises common failure types, provides step‑by‑step troubleshooting procedures, essential commands, configuration examples for systemd‑resolved, BIND9, Unbound and CoreDNS, and offers best‑practice recommendations for reliable DNS operation in Linux and Kubernetes environments.
Overview
DNS resolves a name through a layered chain: application cache → OS cache → /etc/hosts → local resolver (as defined in /etc/resolv.conf) → root servers → TLD servers → authoritative servers. Any link can fail, causing connection time‑outs, service‑unavailable errors, or TLS verification failures.
Typical failure categories
Complete resolution failure : NXDOMAIN or SERVFAIL (expired domain, down authoritative server, network outage).
Incorrect results : Wrong IP due to cache poisoning, hijacking, or mis‑configured records.
High latency : Queries take seconds (overloaded resolver, long recursive chain, DNSSEC delays).
Intermittent failures : Sporadic time‑outs caused by partial resolver failures, UDP loss, or conntrack race conditions in Kubernetes.
Local configuration errors : Broken /etc/resolv.conf, /etc/nsswitch.conf, or conflicts between systemd‑resolved and traditional resolvers.
Core concepts
Record types : A, AAAA, CNAME, NS, SOA, PTR, MX, TXT, SRV.
TTL : Controls cache lifetime; too long delays propagation, too short increases load.
EDNS0 : Extends UDP payload up to 4096 bytes, required for DNSSEC and large responses.
Supported environment
Ubuntu 24.04 LTS or Rocky Linux 9.5 with kernel 6.12+
Tools: dnsutils (dig, nslookup), mtr‑tiny, tcpdump, whois, iputils‑ping Optional services: systemd‑resolved, BIND9, Unbound, CoreDNS, dnsmasq, Prometheus, Grafana
Diagnostic workflow
Install the toolkit
# Ubuntu
sudo apt update && sudo apt install -y dnsutils mtr-tiny tcpdump whois iputils-ping
# Rocky Linux
sudo dnf install -y bind-utils mtr tcpdump whois iputilsVerify current DNS configuration
# Show resolv.conf
cat /etc/resolv.conf
# Show name resolution order
grep ^hosts /etc/nsswitch.confKey fields in resolv.conf are nameserver, search, options ndots, options timeout, options attempts, options rotate, and options single-request-reopen.
Baseline performance test
for i in $(seq 1 10); do dig +noall +stats example.com | grep "Query time"; doneCached queries should be <5 ms; uncached <20‑100 ms.
Tool‑specific usage dig
# Basic query
dig example.com
# Specific types
dig example.com AAAA
dig example.com MX
# Query a particular server
dig @8.8.8.8 example.com
# Full trace
dig +trace example.com
# Short output
dig +short example.com
# DNSSEC validation
dig +dnssec example.com
# Disable DNSSEC (testing)
dig +cd example.com
# Force TCP
dig +tcp example.comnslookup / host
# Interactive nslookup
nslookup example.com
nslookup example.com 8.8.8.8
# Concise host output
host example.com
host -t MX example.com
host -a example.com
host -x 93.184.216.34tcpdump
# Capture all DNS traffic (UDP 53)
sudo tcpdump -i any port 53 -nn -l
# Save to file for later analysis
sudo tcpdump -i any port 53 -w /tmp/dns_capture.pcap -c 1000
# Capture only responses
sudo tcpdump -i any 'src port 53' -nn -lLayered troubleshooting
Local files : Check /etc/hosts for stray entries and correct permissions.
resolv.conf : Ensure it is not overwritten (NetworkManager, systemd‑resolved, DHCP). Use options timeout:2 attempts:2 rotate single-request-reopen for production.
systemd‑resolved :
systemctl status systemd-resolved
resolvectl status
resolvectl statistics
resolvectl flush-caches
journalctl -u systemd-resolved -fEnable debug logging via /etc/systemd/resolved.conf.d/debug.conf if needed.
BIND9 :
systemctl status named # or bind9
named-checkconf /etc/named.conf
named-checkzone example.com /var/named/example.com.zone
rndc flush
rndc stats
rndc recursingCoreDNS (Kubernetes) :
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100
kubectl get configmap coredns -o yaml
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup example.comDNS hijacking detection : Compare results from multiple public resolvers, use dig +tcp, or query a DoH endpoint with curl.
DNSSEC validation :
dig +dnssec example.com
dig example.com DNSKEY
dig example.com DS
delv example.com # BIND9 helperCommon failures are missing DS records, expired RRSIG, or unsynchronized system clock.
Common case studies and fixes
resolv.conf overwritten by NetworkManager : Disable NM DNS management with /etc/NetworkManager/conf.d/dns-none.conf (set dns=none) or set static DNS via
nmcli connection modify "eth0" ipv4.dns "10.0.0.2 10.0.0.3" ipv4.ignore-auto-dns yes.
5 s start‑up delay caused by AAAA time‑outs : Add options single-request-reopen to resolv.conf, disable IPv6 in the JVM ( -Djava.net.preferIPv4Stack=true) or globally via /etc/sysctl.d/99-disable-ipv6.conf, or ensure the authoritative server returns proper empty AAAA responses.
Split‑DNS conflict (internal vs. public) : Use systemd‑resolved split‑DNS configuration or dnsmasq to route corp.example.com to internal nameservers while keeping public fallback DNS for other domains.
Health‑check script (bash)
#!/bin/bash
# dns_health_check.sh – periodic health check
DNS_SERVERS=(10.0.0.2 10.0.0.3 8.8.8.8)
declare -A TEST_DOMAINS=(
[internal-api.corp.example.com]=10.0.1.50
[db-master.corp.example.com]=10.0.2.10
[example.com]=""
[google.com]=""
)
LATENCY_THRESHOLD=500
WEBHOOK_URL="https://example.com/webhook"
STATE_FILE="/tmp/dns_health_state"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
send_alert(){
local msg="$1"
curl -s -X POST "$WEBHOOK_URL" -H "Content-Type: application/json" \
-d "{\"msgtype\":\"text\",\"text\":{\"content\":\"[DNS Alert] $TIMESTAMP
$msg\"}}" >/dev/null 2>&1
}
check_dns(){
local server="$1" domain="$2" expected="$3"
local out=$(dig @${server} +time=2 +tries=1 +noall +answer +stats ${domain} 2>/dev/null)
local ip=$(echo "$out" | grep -E "^${domain}" | grep -oP "\d+\.\d+\.\d+\.\d+" | head -1)
local time=$(echo "$out" | grep "Query time" | grep -oP "\d+")
local err=""
[[ -z $ip ]] && err="Server $server cannot resolve $domain"
[[ -n $expected && $ip != $expected ]] && err="$err; expected $expected got $ip"
[[ -n $time && $time -gt $LATENCY_THRESHOLD ]] && err="$err; latency ${time}ms > $LATENCY_THRESHOLD"
[[ -n $err ]] && echo "$err" && return 1 || return 0
}
all_errors=""
total=0 failed=0
for s in "${DNS_SERVERS[@]}"; do
for d in "${!TEST_DOMAINS[@]}"; do
exp=${TEST_DOMAINS[$d]}
((total++))
if msg=$(check_dns $s $d $exp); then
echo "[$TIMESTAMP] OK: $s -> $d"
else
((failed++))
all_errors+="$msg
"
echo "[$TIMESTAMP] FAIL: $msg"
fi
done
done
echo "[$TIMESTAMP] Completed: $total checks, $failed failures"
if (( failed > 0 )); then
cur_hash=$(echo -e "$all_errors" | md5sum | awk '{print $1}')
last_hash=$(cat "$STATE_FILE" 2>/dev/null || echo "")
if [[ $cur_hash != $last_hash ]]; then
send_alert "Detected $failed DNS issues:
$all_errors"
echo "$cur_hash" > "$STATE_FILE"
fi
else
[[ -f "$STATE_FILE" ]] && { send_alert "DNS health restored – all $total checks passed"; rm -f "$STATE_FILE"; }
fiBest practices
High‑availability resolvers : Deploy at least two local recursive servers in different AZs, configure nameserver entries with options rotate timeout:2 attempts:2. Use Keepalived + Unbound/BIND9 for virtual IP failover.
Local cache optimization : Systemd‑resolved cache is limited; for larger caches use Unbound or dnsmasq. Warm the cache after a restart with a simple script that queries frequently used names.
Security hardening : Restrict recursion to internal networks, enable response‑rate limiting, hide server version, and prefer DNS‑over‑TLS/HTTPS for external queries.
Compatibility notes : glibc respects options single-request-reopen while musl does not; Docker containers inherit the host /etc/resolv.conf – avoid the 127.0.0.53 stub unless the container runs systemd‑resolved. Adjust ndots in Kubernetes to avoid extra internal lookups.
Monitoring and alerting
Key Prometheus metrics: query latency (< 50 ms normal), failure rate (< 0.1 % normal), cache hit ratio (> 80 %), TCP query ratio (< 5 %). Example Blackbox Exporter module for internal DNS:
modules:
dns_internal:
prober: dns
timeout: 5s
dns:
query_name: "internal-api.corp.example.com"
query_type: "A"
valid_rcodes: [NOERROR]
validate_answer_rrs:
fail_if_none_matches_regexp: [".*10\\.0\\.1\\.50.*"]
preferred_ip_protocol: "ip4"Prometheus scrape config (excerpt):
scrape_configs:
- job_name: 'dns_probe'
metrics_path: /probe
params:
module: [dns_internal]
static_configs:
- targets: ['10.0.0.2', '10.0.0.3', '8.8.8.8']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 'blackbox-exporter:9115'Typical alert rules (simplified):
- alert: DNSResolutionSlow
expr: probe_dns_lookup_time_seconds > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "DNS resolution latency high"
description: "Resolver {{ $labels.instance }} latency {{ $value }}s exceeds 200 ms"
- alert: DNSResolutionFailed
expr: probe_success{job="dns_probe"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "DNS resolution failure"
description: "Resolver {{ $labels.instance }} failed to resolve target for 2 minutes"Backup & restore
Backup script copies /etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts, resolver configuration directories, and server configs (BIND9, Unbound, dnsmasq) into a timestamped archive and stores a baseline dig output for verification. Restoration simply extracts the archive and overwrites the original files followed by service restarts.
Key take‑aways
Diagnose DNS step‑by‑step: local files → resolv.conf → resolver service → recursive chain → authoritative server.
Use dig +trace to pinpoint the failing hop.
Enable options single-request-reopen to avoid the classic 5 s IPv6 timeout.
In Kubernetes, lower ndots or use FQDNs to reduce unnecessary internal lookups.
Deploy HA resolvers, tune cache TTLs, and monitor latency, failure rate, and cache hit ratio.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
