Troubleshooting DNS Latency After Machine Replacement in a Go Service
The article details a step‑by‑step investigation of why HTTP request latency increased after moving a Go‑based service to new hardware, focusing on DNS resolution delays, the role of DNSmasq, Go's resolver implementation, and the experiments that led to fixing the issue.
Recently I encountered a DNS‑related issue where a Go service, after being migrated to new physical machines, showed a significant increase in HTTP request latency (about 1.5× slower). This post shares the entire troubleshooting process.
Background
The service runs on physical machines because it is a foundational component with specific deployment constraints. The old machine was out of warranty, prompting a hardware replacement.
After migration, the HTTP request time increased noticeably, as shown by monitoring graphs comparing the old (yellow) and new (blue) machines.
Problem Investigation
Initial checks of CPU, network, and other system metrics showed the new machine performed better than the old one. Conversations with the hardware provider revealed no obvious hardware issues.
To pinpoint the slowdown, I used
curl -o /dev/null -s -w %{time_namelookup}::%{time_connect}::%{time_total}
http://www.baidu.comto measure DNS lookup, connection, and total request times on both machines.
New machine: 0.001484::0.001743::0.007489 Old machine: 0.000681::0.000912::0.002475 Although DNS lookup was slightly slower on the new machine, the difference was negligible for the overall request time. To verify DNS as the culprit, I ran dig www.baidu.com, which exhibited noticeable latency, confirming a DNS problem.
Problem Resolution
Research pointed to missing DNS caching. The network team suggested installing dnsmasq or adjusting /etc/resolv.conf. The original /etc/resolv.conf pointed to 127.0.0.1, but no DNS service was running locally.
Removing the 127.0.0.1 entry alone did not improve latency. After installing dnsmasq and keeping the local entry, the request time dropped significantly, indicating that the lack of a local DNS cache was the root cause.
Reflection
Open questions remain: why was the local DNS server configured without the service running, and why did removing the entry appear ineffective?
Go's DNS Resolution Process
Go supports two resolution methods: cgo (using the system's C library) and a pure‑Go implementation. The resolver chooses the method based on configuration and platform.
Key parts of the Go source ( lookup_unix.go) show the decision flow:
func (r *Resolver) lookupIP(ctx context.Context, network, host string) (addrs []IPAddr, err error) {
// ① Force pure‑Go resolver if needed
if r.preferGo() {
return r.goLookupIP(ctx, host)
}
// ② Determine lookup order
order := systemConf().hostLookupOrder(r, host)
if order == hostLookupCgo {
if addrs, err, ok := cgoLookupIP(ctx, network, host); ok {
return addrs, err
}
// Fallback to Go resolver
order = hostLookupFilesDNS
}
ips, _, err := r.goLookupIPCNAMEOrder(ctx, host, order)
return ips, err
}The resolver periodically (every 5 seconds) reloads /etc/resolv.conf to pick up changes, as shown in tryUpdate:
func (conf *resolverConfig) tryUpdate(name string) {
conf.initOnce.Do(conf.init)
now := time.Now()
if conf.lastChecked.After(now.Add(-5 * time.Second)) {
return
}
conf.lastChecked = now
dnsConf := dnsReadConfig(name)
conf.mu.Lock()
conf.dnsConfig = dnsConf
conf.mu.Unlock()
}The resolver reads /etc/hosts, parses /etc/resolv.conf, constructs DNS queries, and performs retries with round‑robin server selection.
Hypotheses Tested
Hypothesis 1: Go reads /etc/resolv.conf only at startup. Experiments showed the resolver updates the config lazily, disproving this hypothesis.
Hypothesis 2: Remote DNS queries are inherently slower than local cache. Benchmarks with and without dnsmasq showed comparable times, indicating no significant difference.
Hypothesis 3: High concurrency causes lock contention. A 100‑goroutine test across three environments (no local DNS, local DNS with dnsmasq, local DNS without dnsmasq) yielded similar latencies, suggesting concurrency is not the issue.
Conclusion
The slowdown was caused by the absence of a local DNS cache (dnsmasq) on the new machine. Installing dnsmasq resolved the latency problem, though the exact reason why the DNS lookup appeared slow initially remains unclear. Readers are invited to discuss further troubleshooting methods.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
