Why a Missed DNS Renewal Shut Down Our Site—and How We Fixed It
A detailed post‑mortem recounts how a forgotten domain renewal caused a DNS outage, the frantic troubleshooting steps across teams, temporary work‑arounds like switching to Google DNS, and the lessons learned for future incident management.
1. Event Background
Time: Early Saturday morning, 5 AM, 2015.
Incident: Users reported the company website was inaccessible, while some could still reach it; customer service initially blamed network issues.
Impact: By 8 AM, more users and the mobile app also failed to load, prompting an emergency response.
2. Analysis and Positioning
After being awakened, the on‑call engineer reviewed recent deployments: a new module, a bug fix, and an HTTPS configuration change, none of which should affect availability.
Initial checks of web and database servers, logs, and monitoring showed everything normal. A ping to the domain failed, suggesting a network problem.
External testing confirmed the site was reachable from other networks, indicating the service itself was fine but DNS resolution was failing for many users.
3. Tackling the Problem
By 10 AM the team suspected DNS issues, especially after discovering that only China Unicom users could not resolve the domain.
Multiple tickets were opened with the domain registrar (Wanwang) and dozens of calls were made to Unicom. The team also used ipconfig /flushdns to clear local caches.
Testing tools like 17ce and 360 QiYunCe were employed to monitor accessibility across ISPs and regions.
It was later found that the domain had fallen into a delinquent status due to an unpaid renewal, causing the registrar to suspend DNS resolution.
4. Temporary Solutions
Switching local DNS to Google’s 8.8.8.8 restored access for most users. The team also released a temporary client version that used the server’s IP address instead of the domain.
For iOS users, manual DNS configuration was suggested; Android users could simply reinstall the app.
5. DNS Resolution Process
DNS (Domain Name System) translates human‑readable domain names to IP addresses through a hierarchical lookup involving browser cache, hosts file, local resolver cache, local DNS server, root servers, TLD servers, and authoritative name servers.
Step 1: Browser cache check.
Step 2: Hosts file lookup.
Step 3: Local DNS resolver cache.
Step 4: Query the configured local DNS server.
Step 5: If cached, return non‑authoritative answer.
Step 6: If not cached, the local server queries root servers, then TLD servers, then authoritative servers.
Step 7: The answer propagates back to the client.
6. Lessons Learned
1. Process management gaps – inadequate handover when staff left.
2. Immature crisis handling – damage to company reputation.
3. Incomplete monitoring – lack of alerts for DNS resolution failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
