Operations 13 min read

Why a Missed DNS Renewal Shut Down Our Site—and How We Fixed It

A detailed post‑mortem recounts how a forgotten domain renewal caused a DNS outage, the frantic troubleshooting steps across teams, temporary work‑arounds like switching to Google DNS, and the lessons learned for future incident management.

Efficient Ops
Efficient Ops
Efficient Ops
Why a Missed DNS Renewal Shut Down Our Site—and How We Fixed It

1. Event Background

Time: Early Saturday morning, 5 AM, 2015.

Incident: Users reported the company website was inaccessible, while some could still reach it; customer service initially blamed network issues.

Impact: By 8 AM, more users and the mobile app also failed to load, prompting an emergency response.

2. Analysis and Positioning

After being awakened, the on‑call engineer reviewed recent deployments: a new module, a bug fix, and an HTTPS configuration change, none of which should affect availability.

Initial checks of web and database servers, logs, and monitoring showed everything normal. A ping to the domain failed, suggesting a network problem.

External testing confirmed the site was reachable from other networks, indicating the service itself was fine but DNS resolution was failing for many users.

3. Tackling the Problem

By 10 AM the team suspected DNS issues, especially after discovering that only China Unicom users could not resolve the domain.

Multiple tickets were opened with the domain registrar (Wanwang) and dozens of calls were made to Unicom. The team also used ipconfig /flushdns to clear local caches.

Testing tools like 17ce and 360 QiYunCe were employed to monitor accessibility across ISPs and regions.

It was later found that the domain had fallen into a delinquent status due to an unpaid renewal, causing the registrar to suspend DNS resolution.

4. Temporary Solutions

Switching local DNS to Google’s 8.8.8.8 restored access for most users. The team also released a temporary client version that used the server’s IP address instead of the domain.

For iOS users, manual DNS configuration was suggested; Android users could simply reinstall the app.

5. DNS Resolution Process

DNS (Domain Name System) translates human‑readable domain names to IP addresses through a hierarchical lookup involving browser cache, hosts file, local resolver cache, local DNS server, root servers, TLD servers, and authoritative name servers.

Step 1: Browser cache check.

Step 2: Hosts file lookup.

Step 3: Local DNS resolver cache.

Step 4: Query the configured local DNS server.

Step 5: If cached, return non‑authoritative answer.

Step 6: If not cached, the local server queries root servers, then TLD servers, then authoritative servers.

Step 7: The answer propagates back to the client.

6. Lessons Learned

1. Process management gaps – inadequate handover when staff left.

2. Immature crisis handling – damage to company reputation.

3. Incomplete monitoring – lack of alerts for DNS resolution failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

incident responsetroubleshootingDNSdomain management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.