Operations 50 min read

Why DNS Lookups Fail and How to Fix Them: A Complete Troubleshooting Guide

This guide explains the DNS resolution process, categorises common failure types, provides step‑by‑step troubleshooting procedures, essential commands, configuration examples for systemd‑resolved, BIND9, Unbound and CoreDNS, and offers best‑practice recommendations for reliable DNS operation in Linux and Kubernetes environments.

Ops Community
Ops Community
Ops Community
Why DNS Lookups Fail and How to Fix Them: A Complete Troubleshooting Guide

Overview

DNS resolves a name through a layered chain: application cache → OS cache → /etc/hosts → local resolver (as defined in /etc/resolv.conf) → root servers → TLD servers → authoritative servers. Any link can fail, causing connection time‑outs, service‑unavailable errors, or TLS verification failures.

Typical failure categories

Complete resolution failure : NXDOMAIN or SERVFAIL (expired domain, down authoritative server, network outage).

Incorrect results : Wrong IP due to cache poisoning, hijacking, or mis‑configured records.

High latency : Queries take seconds (overloaded resolver, long recursive chain, DNSSEC delays).

Intermittent failures : Sporadic time‑outs caused by partial resolver failures, UDP loss, or conntrack race conditions in Kubernetes.

Local configuration errors : Broken /etc/resolv.conf, /etc/nsswitch.conf, or conflicts between systemd‑resolved and traditional resolvers.

Core concepts

Record types : A, AAAA, CNAME, NS, SOA, PTR, MX, TXT, SRV.

TTL : Controls cache lifetime; too long delays propagation, too short increases load.

EDNS0 : Extends UDP payload up to 4096 bytes, required for DNSSEC and large responses.

Supported environment

Ubuntu 24.04 LTS or Rocky Linux 9.5 with kernel 6.12+

Tools: dnsutils (dig, nslookup), mtr‑tiny, tcpdump, whois, iputils‑ping Optional services: systemd‑resolved, BIND9, Unbound, CoreDNS, dnsmasq, Prometheus, Grafana

Diagnostic workflow

Install the toolkit

# Ubuntu
sudo apt update && sudo apt install -y dnsutils mtr-tiny tcpdump whois iputils-ping
# Rocky Linux
sudo dnf install -y bind-utils mtr tcpdump whois iputils

Verify current DNS configuration

# Show resolv.conf
cat /etc/resolv.conf
# Show name resolution order
grep ^hosts /etc/nsswitch.conf

Key fields in resolv.conf are nameserver, search, options ndots, options timeout, options attempts, options rotate, and options single-request-reopen.

Baseline performance test

for i in $(seq 1 10); do dig +noall +stats example.com | grep "Query time"; done

Cached queries should be <5 ms; uncached <20‑100 ms.

Tool‑specific usage dig

# Basic query
 dig example.com
# Specific types
 dig example.com AAAA
 dig example.com MX
# Query a particular server
 dig @8.8.8.8 example.com
# Full trace
 dig +trace example.com
# Short output
 dig +short example.com
# DNSSEC validation
 dig +dnssec example.com
# Disable DNSSEC (testing)
 dig +cd example.com
# Force TCP
 dig +tcp example.com

nslookup / host

# Interactive nslookup
 nslookup example.com
 nslookup example.com 8.8.8.8
# Concise host output
 host example.com
 host -t MX example.com
 host -a example.com
 host -x 93.184.216.34

tcpdump

# Capture all DNS traffic (UDP 53)
 sudo tcpdump -i any port 53 -nn -l
# Save to file for later analysis
 sudo tcpdump -i any port 53 -w /tmp/dns_capture.pcap -c 1000
# Capture only responses
 sudo tcpdump -i any 'src port 53' -nn -l

Layered troubleshooting

Local files : Check /etc/hosts for stray entries and correct permissions.

resolv.conf : Ensure it is not overwritten (NetworkManager, systemd‑resolved, DHCP). Use options timeout:2 attempts:2 rotate single-request-reopen for production.

systemd‑resolved :

systemctl status systemd-resolved
resolvectl status
resolvectl statistics
resolvectl flush-caches
journalctl -u systemd-resolved -f

Enable debug logging via /etc/systemd/resolved.conf.d/debug.conf if needed.

BIND9 :

systemctl status named   # or bind9
named-checkconf /etc/named.conf
named-checkzone example.com /var/named/example.com.zone
rndc flush
rndc stats
rndc recursing

CoreDNS (Kubernetes) :

kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100
kubectl get configmap coredns -o yaml
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup example.com

DNS hijacking detection : Compare results from multiple public resolvers, use dig +tcp, or query a DoH endpoint with curl.

DNSSEC validation :

dig +dnssec example.com
 dig example.com DNSKEY
 dig example.com DS
 delv example.com   # BIND9 helper

Common failures are missing DS records, expired RRSIG, or unsynchronized system clock.

Common case studies and fixes

resolv.conf overwritten by NetworkManager : Disable NM DNS management with /etc/NetworkManager/conf.d/dns-none.conf (set dns=none) or set static DNS via

nmcli connection modify "eth0" ipv4.dns "10.0.0.2 10.0.0.3" ipv4.ignore-auto-dns yes

.

5 s start‑up delay caused by AAAA time‑outs : Add options single-request-reopen to resolv.conf, disable IPv6 in the JVM ( -Djava.net.preferIPv4Stack=true) or globally via /etc/sysctl.d/99-disable-ipv6.conf, or ensure the authoritative server returns proper empty AAAA responses.

Split‑DNS conflict (internal vs. public) : Use systemd‑resolved split‑DNS configuration or dnsmasq to route corp.example.com to internal nameservers while keeping public fallback DNS for other domains.

Health‑check script (bash)

#!/bin/bash
# dns_health_check.sh – periodic health check
DNS_SERVERS=(10.0.0.2 10.0.0.3 8.8.8.8)
declare -A TEST_DOMAINS=(
  [internal-api.corp.example.com]=10.0.1.50
  [db-master.corp.example.com]=10.0.2.10
  [example.com]=""
  [google.com]=""
)
LATENCY_THRESHOLD=500
WEBHOOK_URL="https://example.com/webhook"
STATE_FILE="/tmp/dns_health_state"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

send_alert(){
  local msg="$1"
  curl -s -X POST "$WEBHOOK_URL" -H "Content-Type: application/json" \
    -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"[DNS Alert] $TIMESTAMP
$msg\"}}" >/dev/null 2>&1
}

check_dns(){
  local server="$1" domain="$2" expected="$3"
  local out=$(dig @${server} +time=2 +tries=1 +noall +answer +stats ${domain} 2>/dev/null)
  local ip=$(echo "$out" | grep -E "^${domain}" | grep -oP "\d+\.\d+\.\d+\.\d+" | head -1)
  local time=$(echo "$out" | grep "Query time" | grep -oP "\d+")
  local err=""
  [[ -z $ip ]] && err="Server $server cannot resolve $domain"
  [[ -n $expected && $ip != $expected ]] && err="$err; expected $expected got $ip"
  [[ -n $time && $time -gt $LATENCY_THRESHOLD ]] && err="$err; latency ${time}ms > $LATENCY_THRESHOLD"
  [[ -n $err ]] && echo "$err" && return 1 || return 0
}

all_errors=""
total=0 failed=0
for s in "${DNS_SERVERS[@]}"; do
  for d in "${!TEST_DOMAINS[@]}"; do
    exp=${TEST_DOMAINS[$d]}
    ((total++))
    if msg=$(check_dns $s $d $exp); then
      echo "[$TIMESTAMP] OK: $s -> $d"
    else
      ((failed++))
      all_errors+="$msg
"
      echo "[$TIMESTAMP] FAIL: $msg"
    fi
  done
done

echo "[$TIMESTAMP] Completed: $total checks, $failed failures"
if (( failed > 0 )); then
  cur_hash=$(echo -e "$all_errors" | md5sum | awk '{print $1}')
  last_hash=$(cat "$STATE_FILE" 2>/dev/null || echo "")
  if [[ $cur_hash != $last_hash ]]; then
    send_alert "Detected $failed DNS issues:
$all_errors"
    echo "$cur_hash" > "$STATE_FILE"
  fi
else
  [[ -f "$STATE_FILE" ]] && { send_alert "DNS health restored – all $total checks passed"; rm -f "$STATE_FILE"; }
fi

Best practices

High‑availability resolvers : Deploy at least two local recursive servers in different AZs, configure nameserver entries with options rotate timeout:2 attempts:2. Use Keepalived + Unbound/BIND9 for virtual IP failover.

Local cache optimization : Systemd‑resolved cache is limited; for larger caches use Unbound or dnsmasq. Warm the cache after a restart with a simple script that queries frequently used names.

Security hardening : Restrict recursion to internal networks, enable response‑rate limiting, hide server version, and prefer DNS‑over‑TLS/HTTPS for external queries.

Compatibility notes : glibc respects options single-request-reopen while musl does not; Docker containers inherit the host /etc/resolv.conf – avoid the 127.0.0.53 stub unless the container runs systemd‑resolved. Adjust ndots in Kubernetes to avoid extra internal lookups.

Monitoring and alerting

Key Prometheus metrics: query latency (< 50 ms normal), failure rate (< 0.1 % normal), cache hit ratio (> 80 %), TCP query ratio (< 5 %). Example Blackbox Exporter module for internal DNS:

modules:
  dns_internal:
    prober: dns
    timeout: 5s
    dns:
      query_name: "internal-api.corp.example.com"
      query_type: "A"
      valid_rcodes: [NOERROR]
      validate_answer_rrs:
        fail_if_none_matches_regexp: [".*10\\.0\\.1\\.50.*"]
      preferred_ip_protocol: "ip4"

Prometheus scrape config (excerpt):

scrape_configs:
  - job_name: 'dns_probe'
    metrics_path: /probe
    params:
      module: [dns_internal]
    static_configs:
      - targets: ['10.0.0.2', '10.0.0.3', '8.8.8.8']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 'blackbox-exporter:9115'

Typical alert rules (simplified):

- alert: DNSResolutionSlow
  expr: probe_dns_lookup_time_seconds > 0.2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "DNS resolution latency high"
    description: "Resolver {{ $labels.instance }} latency {{ $value }}s exceeds 200 ms"
- alert: DNSResolutionFailed
  expr: probe_success{job="dns_probe"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "DNS resolution failure"
    description: "Resolver {{ $labels.instance }} failed to resolve target for 2 minutes"

Backup & restore

Backup script copies /etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts, resolver configuration directories, and server configs (BIND9, Unbound, dnsmasq) into a timestamped archive and stores a baseline dig output for verification. Restoration simply extracts the archive and overwrites the original files followed by service restarts.

Key take‑aways

Diagnose DNS step‑by‑step: local files → resolv.conf → resolver service → recursive chain → authoritative server.

Use dig +trace to pinpoint the failing hop.

Enable options single-request-reopen to avoid the classic 5 s IPv6 timeout.

In Kubernetes, lower ndots or use FQDNs to reduce unnecessary internal lookups.

Deploy HA resolvers, tune cache TTLs, and monitor latency, failure rate, and cache hit ratio.

KubernetesLinuxTroubleshootingDNSsystemd-resolved
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.