How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem
At 3:07 AM on August 15 2025 a critical alert indicated the entire site was inaccessible, leading to a 2‑hour, 500 k‑user outage caused by an expired domain that entered serverHold status, and this postmortem details the detection, root‑cause analysis, emergency recovery steps, and long‑term remediation measures.
Overview
On August 15, 2025 at 03:07 the on‑call engineer was woken by an urgent call: the whole site was down and users could not access any service. Initial health checks showed all internal services healthy, but external probes were red, pointing to a DNS‑related problem.
Impact
Duration: 2 hours 17 minutes
Scope: All external services unavailable
Users affected: ~500 000
Direct loss: ~800 000 (estimated from GMV)
Indirect loss: Brand reputation, user complaints
Timeline
03:07 - Alert received, abnormality discovered
03:12 - Started investigation, services appear normal
03:18 - DNS resolution issue identified
03:25 - Attempted DNS provider switch
03:35 - Discovered domain had expired
03:40 - Contacted registrar for emergency renewal
04:15 - Domain renewal succeeded
04:50 - DNS records restored
05:24 - All services back to normal
05:30 - Incident closed, post‑mortem writtenDetection and Initial Investigation
Alert Trigger
[CRITICAL] HTTP probe failed - https://www.example.com - Connection timeout
[CRITICAL] HTTP probe failed - https://api.example.com - Connection timeout
[CRITICAL] HTTP probe failed - https://m.example.com - Connection timeout
[WARNING] External availability dropped to 0%All pods in the gateway and production namespaces were checked with kubectl get pods and showed no issues.
Network Layer Check
curl -v https://www.example.com
# * Could not resolve host: www.example.com
nslookup www.example.com 8.8.8.8
# server can't find www.example.com: SERVFAIL
nslookup www.example.com 1.1.1.1
# server can't find www.example.com: SERVFAILBoth public DNS resolvers failed, confirming the problem was with the domain itself.
Deep DNS Diagnosis
dig www.example.com +trace
# (simplified output showing root, .com, and example.com NS records)
# connection timed out; no servers could be reached
# Authoritative DNS servers were unreachable
dig example.com NS
dig @ns1.dnsprovider.com www.example.com
# connection timed out; no servers could be reachedThe authoritative name servers could not be reached.
Root‑Cause Identification
First Hypothesis – DNS Provider Outage
curl -s https://status.dnsprovider.com | grep -i "operational"
# All Systems OperationalThe provider status page was normal, so the issue lay elsewhere.
Second Hypothesis – DNS Records Tampered
Checking the DNS management console showed all records correct and no recent changes, but the domain status displayed serverHold .
Truth Revealed
whois example.com
# Domain Name: EXAMPLE.COM
# Registry Expiry Date: 2022-08-15
# Domain Status: serverHold https://icann.org/epp#serverHoldThe domain had expired on 2022‑08‑15, causing the registrar to place it in serverHold status.
Why Renewal Was Missed
Renewal reminder emails were sent to a former employee who had left the company.
Only a single contact was registered for the domain.
The renewal required a procurement process that was never initiated.
No monitoring existed for domain expiration dates.
Emergency Recovery
Renewal Procedure
1. Log into registrar portal – success
2. Attempt online renewal – failed ("domain is in redemption period")
3. Contact registrar support – out of office hours, no answer
4. Submit urgent ticket – wait for response
5. Find 24‑hour emergency contact number – overseas line
6. Call internationally – successAfter a lengthy identity verification, the registrar renewed the domain at 04:15.
DNS Restoration
# Continuously monitor DNS resolution
while true; do
echo "=== $(date) ==="
dig www.example.com +short
sleep 60
done
# Output showed empty responses until 04:50, then the correct IPs appeared.Global DNS caches needed time to propagate the new records.
Accelerating Recovery
# 1. Purge CDN cache (if any)
# 2. Ask cloud provider to flush DNS
# 3. Manually query multiple public DNS servers to trigger updates
for dns in 8.8.8.8 8.8.4.4 1.1.1.1 9.9.9.9 208.67.222.222; do
echo "Testing $dns:"
dig @$dns www.example.com +short
doneBy 05:24 all external probes turned green and the incident was declared resolved.
Post‑mortem Analysis
Direct Cause
The domain expired, causing the registrar to set the status to serverHold, which broke DNS resolution.
Root Causes
Missing process: domain renewal was not part of the IT asset management workflow.
Monitoring blind spot: no alerts for upcoming domain expirations.
Single point of contact: only one person responsible, who had left.
Knowledge gap: no documented procedures for domain management.
Impact Analysis
Business impact:
- Website unavailable – 2h17m
- Mobile app unavailable – 2h17m
- API services down – 2h17m
- ~200 user complaints
- Estimated loss ~800k
Technical impact:
- DNS resolution completely failed
- CDN cache invalidated
- Some SSL certificates could not be verified
- Internal services that used public DNS names failedImprovement Measures
Immediate (within 1 week)
Domain renewal: Extend all critical domains for at least 5 years.
Add multiple contacts: Register primary, secondary, and team contacts for each domain.
Enable auto‑renew: Bind a corporate credit card to the registrar for automatic payment.
Short‑term (within 1 month)
Domain monitoring system:
#!/usr/bin/env python3
import whois, smtplib
from datetime import datetime, timedelta
DOMAINS = ['example.com', 'example.cn', 'example-api.com']
ALERT_DAYS = [90,60,30,14,7,3,1]
def check_domain_expiry(domain):
w = whois.whois(domain)
expiry = w.expiration_date
if isinstance(expiry, list): expiry = expiry[0]
return expiry
def days_until(expiry):
if not expiry: return -1
return (expiry - datetime.now()).days
def send_alert(domain, days, expiry):
print(f"ALERT: {domain} expires in {days} days ({expiry})")
for d in DOMAINS:
exp = check_domain_expiry(d)
days = days_until(exp)
if days <= 0 or days in ALERT_DAYS:
send_alert(d, days, exp)
if __name__ == '__main__':
main()Asset management workflow: Record domain name, registrar, expiry date, contacts, and purpose in the IT asset system; automate daily checks and alert at 90/60/30/7 days before expiry.
Prometheus monitoring:
groups:
- name: domain_expiry
rules:
- alert: DomainExpiryWarning
expr: domain_expiry_days < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Domain is about to expire"
description: "Domain {{ $labels.domain }} will expire in {{ $value }} days"
- alert: DomainExpiryCritical
expr: domain_expiry_days < 7
for: 1h
labels:
severity: critical
annotations:
summary: "Domain expiry critical"
description: "Domain {{ $labels.domain }} will expire in {{ $value }} days, renew immediately"Long‑term
Multiple DNS providers: Configure primary (Cloudflare) and secondary (AWS Route 53) NS records with health checks and automatic failover.
Disaster‑recovery drills: Quarterly simulations of DNS provider failures, domain expiration, and switch‑over procedures.
Avoid internal service dependence on external DNS: Use internal service discovery (Kubernetes ClusterIP) or private DNS zones for intra‑cluster communication.
DNS Outage Checklist
# DNS outage checklist
## 1. Verify DNS resolution
dig www.example.com +short
nslookup www.example.com 8.8.8.8
## 2. Check domain status
whois example.com | grep -E "Status|Expiry"
## 3. Trace DNS path
dig www.example.com +trace
## 4. Inspect authoritative DNS
dig example.com NS
dig @ns1.provider.com www.example.com
## 5. Review DNS records
dig www.example.com A
dig www.example.com CNAME
dig example.com MX
dig example.com TXT
## 6. Check provider status page
# visit provider status URL
## 7. Test network connectivity
ping ns1.provider.com
traceroute ns1.provider.com
## 8. Common DNS response codes
# NOERROR – OK
# NXDOMAIN – name does not exist
# SERVFAIL – server error
# REFUSED – query refused
## 9. Domain status meanings
# ok/active – normal
# serverHold – registrar hold (often unpaid)
# clientHold – registrar hold
# pendingDelete – pending deletion
# redemptionPeriod – in redemptionKey Lessons
Domain names are foundational infrastructure; their failure nullifies any architectural improvements.
Automate everything possible: renewals, monitoring, alerts.
Redundancy is essential – multiple contacts and DNS providers mitigate single‑point failures.
Regular drills expose hidden weaknesses before real incidents occur.
After‑effects
All domains now monitored; alerts fire 90 days before expiry.
Renewal process fully automated with corporate credit‑card billing.
Quarterly DNS disaster‑recovery drills instituted.
No further domain‑related outages have occurred as of 2025.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
