Operations 14 min read

Troubleshooting DNS Latency After Machine Replacement in a Go Service

The article details a step‑by‑step investigation of why HTTP request latency increased after moving a Go‑based service to new hardware, focusing on DNS resolution delays, the role of DNSmasq, Go's resolver implementation, and the experiments that led to fixing the issue.

IT Services Circle
IT Services Circle
IT Services Circle
Troubleshooting DNS Latency After Machine Replacement in a Go Service

Recently I encountered a DNS‑related issue where a Go service, after being migrated to new physical machines, showed a significant increase in HTTP request latency (about 1.5× slower). This post shares the entire troubleshooting process.

Background

The service runs on physical machines because it is a foundational component with specific deployment constraints. The old machine was out of warranty, prompting a hardware replacement.

After migration, the HTTP request time increased noticeably, as shown by monitoring graphs comparing the old (yellow) and new (blue) machines.

Problem Investigation

Initial checks of CPU, network, and other system metrics showed the new machine performed better than the old one. Conversations with the hardware provider revealed no obvious hardware issues.

To pinpoint the slowdown, I used

curl -o /dev/null -s -w %{time_namelookup}::%{time_connect}::%{time_total}
 http://www.baidu.com

to measure DNS lookup, connection, and total request times on both machines.

New machine: 0.001484::0.001743::0.007489 Old machine: 0.000681::0.000912::0.002475 Although DNS lookup was slightly slower on the new machine, the difference was negligible for the overall request time. To verify DNS as the culprit, I ran dig www.baidu.com, which exhibited noticeable latency, confirming a DNS problem.

Problem Resolution

Research pointed to missing DNS caching. The network team suggested installing dnsmasq or adjusting /etc/resolv.conf. The original /etc/resolv.conf pointed to 127.0.0.1, but no DNS service was running locally.

Removing the 127.0.0.1 entry alone did not improve latency. After installing dnsmasq and keeping the local entry, the request time dropped significantly, indicating that the lack of a local DNS cache was the root cause.

Reflection

Open questions remain: why was the local DNS server configured without the service running, and why did removing the entry appear ineffective?

Go's DNS Resolution Process

Go supports two resolution methods: cgo (using the system's C library) and a pure‑Go implementation. The resolver chooses the method based on configuration and platform.

Key parts of the Go source ( lookup_unix.go) show the decision flow:

func (r *Resolver) lookupIP(ctx context.Context, network, host string) (addrs []IPAddr, err error) {
    // ① Force pure‑Go resolver if needed
    if r.preferGo() {
        return r.goLookupIP(ctx, host)
    }
    // ② Determine lookup order
    order := systemConf().hostLookupOrder(r, host)
    if order == hostLookupCgo {
        if addrs, err, ok := cgoLookupIP(ctx, network, host); ok {
            return addrs, err
        }
        // Fallback to Go resolver
        order = hostLookupFilesDNS
    }
    ips, _, err := r.goLookupIPCNAMEOrder(ctx, host, order)
    return ips, err
}

The resolver periodically (every 5 seconds) reloads /etc/resolv.conf to pick up changes, as shown in tryUpdate:

func (conf *resolverConfig) tryUpdate(name string) {
    conf.initOnce.Do(conf.init)
    now := time.Now()
    if conf.lastChecked.After(now.Add(-5 * time.Second)) {
        return
    }
    conf.lastChecked = now
    dnsConf := dnsReadConfig(name)
    conf.mu.Lock()
    conf.dnsConfig = dnsConf
    conf.mu.Unlock()
}

The resolver reads /etc/hosts, parses /etc/resolv.conf, constructs DNS queries, and performs retries with round‑robin server selection.

Hypotheses Tested

Hypothesis 1: Go reads /etc/resolv.conf only at startup. Experiments showed the resolver updates the config lazily, disproving this hypothesis.

Hypothesis 2: Remote DNS queries are inherently slower than local cache. Benchmarks with and without dnsmasq showed comparable times, indicating no significant difference.

Hypothesis 3: High concurrency causes lock contention. A 100‑goroutine test across three environments (no local DNS, local DNS with dnsmasq, local DNS without dnsmasq) yielded similar latencies, suggesting concurrency is not the issue.

Conclusion

The slowdown was caused by the absence of a local DNS cache (dnsmasq) on the new machine. Installing dnsmasq resolved the latency problem, though the exact reason why the DNS lookup appeared slow initially remains unclear. Readers are invited to discuss further troubleshooting methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GoLinuxtroubleshootingMachine Replacement
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.