How We Resolved a Sudden DNS Outage That Took Down Our Website and App
When a Saturday early-morning outage left the company’s website and mobile app inaccessible for many users, the team traced the issue to an unpaid domain causing DNS resolution failures, detailed the investigation steps, temporary fixes, and lessons learned about DNS processes and operational readiness.
Incident Overview
In the early hours of a Saturday, users reported that the public website and mobile application were unreachable. Initial checks on the support representative’s network showed no problem, but the issue quickly spread, especially among users of China Unicom.
Investigation Timeline
At 09:00 the support team convened. Recent deployments (new module, bug fixes, HTTPS configuration) were reviewed and ruled out. Network and operations staff confirmed that web servers, database servers, and monitoring systems were all operational.
Ping tests from the office showed that the domain name could not be resolved, while direct IP access worked, indicating a DNS resolution failure. External checks from public networks reproduced the same symptom.
Root Cause Identification
The domain registrar (Wanwang) had placed the domain in an unpaid status overnight, causing the registrar to suspend DNS services for the domain. Because DNS caches at ISPs update at different times, some users could still resolve the domain while others could not.
The registrar confirmed that full DNS service would be restored within 48 hours after payment, but an immediate fix was required.
Temporary Mitigation
Changed local DNS resolvers to Google Public DNS (8.8.8.8), which restored access for many users.
Released a quick mobile‑app build that used the server’s IP address instead of the domain name.
Provided iOS users with instructions to change their device DNS settings.
Used external monitoring tools (17ce and 360 QiYun) to track DNS propagation across ISPs, confirming that the outage persisted mainly for China Unicom in Beijing.
Executed ipconfig /flushdns on Windows workstations and cleared local DNS caches to eliminate stale records.
Final Resolution
After the domain fee was paid, the registrar re‑enabled DNS. Propagation completed over the next day, and by the following Monday the website and app were reachable for all users.
DNS Resolution Process Explained
DNS translates human‑readable domain names into IP addresses through a hierarchical lookup:
The browser checks its own cache for a stored IP address.
If not cached, the operating system checks the local hosts file.
The local DNS resolver cache is consulted next.
If still unresolved, the query is sent to the configured local DNS server.
The local DNS server returns a cached answer if it has one.
Otherwise it queries the root servers, then the top‑level domain (TLD) servers, and finally the authoritative name servers until the IP address is obtained.
If the local DNS server is set to forward queries, it passes the request to an upstream DNS server, which repeats the process.
This multi‑step process can cause delays when a domain’s DNS records are suspended or not yet propagated.
Lessons Learned
Process gaps: Inadequate handover when staff left resulted in a missed domain renewal.
Crisis response: Lack of a mature incident‑handling procedure affected the company’s reputation.
Monitoring deficiency: Absence of proactive DNS health checks allowed the outage to go unnoticed until users reported it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
