Operations 19 min read

How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

At 3:07 AM on August 15 2025 a critical alert indicated the entire site was inaccessible, leading to a 2‑hour, 500 k‑user outage caused by an expired domain that entered serverHold status, and this postmortem details the detection, root‑cause analysis, emergency recovery steps, and long‑term remediation measures.

Ops Community
Ops Community
Ops Community
How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

Overview

On August 15, 2025 at 03:07 the on‑call engineer was woken by an urgent call: the whole site was down and users could not access any service. Initial health checks showed all internal services healthy, but external probes were red, pointing to a DNS‑related problem.

Impact

Duration: 2 hours 17 minutes

Scope: All external services unavailable

Users affected: ~500 000

Direct loss: ~800 000 (estimated from GMV)

Indirect loss: Brand reputation, user complaints

Timeline

03:07 - Alert received, abnormality discovered
03:12 - Started investigation, services appear normal
03:18 - DNS resolution issue identified
03:25 - Attempted DNS provider switch
03:35 - Discovered domain had expired
03:40 - Contacted registrar for emergency renewal
04:15 - Domain renewal succeeded
04:50 - DNS records restored
05:24 - All services back to normal
05:30 - Incident closed, post‑mortem written

Detection and Initial Investigation

Alert Trigger

[CRITICAL] HTTP probe failed - https://www.example.com - Connection timeout
[CRITICAL] HTTP probe failed - https://api.example.com - Connection timeout
[CRITICAL] HTTP probe failed - https://m.example.com - Connection timeout
[WARNING] External availability dropped to 0%

All pods in the gateway and production namespaces were checked with kubectl get pods and showed no issues.

Network Layer Check

curl -v https://www.example.com
# * Could not resolve host: www.example.com

nslookup www.example.com 8.8.8.8
# server can't find www.example.com: SERVFAIL

nslookup www.example.com 1.1.1.1
# server can't find www.example.com: SERVFAIL

Both public DNS resolvers failed, confirming the problem was with the domain itself.

Deep DNS Diagnosis

dig www.example.com +trace
# (simplified output showing root, .com, and example.com NS records)
# connection timed out; no servers could be reached

# Authoritative DNS servers were unreachable

dig example.com NS

dig @ns1.dnsprovider.com www.example.com
# connection timed out; no servers could be reached

The authoritative name servers could not be reached.

Root‑Cause Identification

First Hypothesis – DNS Provider Outage

curl -s https://status.dnsprovider.com | grep -i "operational"
# All Systems Operational

The provider status page was normal, so the issue lay elsewhere.

Second Hypothesis – DNS Records Tampered

Checking the DNS management console showed all records correct and no recent changes, but the domain status displayed serverHold .

Truth Revealed

whois example.com
# Domain Name: EXAMPLE.COM
# Registry Expiry Date: 2022-08-15
# Domain Status: serverHold https://icann.org/epp#serverHold

The domain had expired on 2022‑08‑15, causing the registrar to place it in serverHold status.

Why Renewal Was Missed

Renewal reminder emails were sent to a former employee who had left the company.

Only a single contact was registered for the domain.

The renewal required a procurement process that was never initiated.

No monitoring existed for domain expiration dates.

Emergency Recovery

Renewal Procedure

1. Log into registrar portal – success
2. Attempt online renewal – failed ("domain is in redemption period")
3. Contact registrar support – out of office hours, no answer
4. Submit urgent ticket – wait for response
5. Find 24‑hour emergency contact number – overseas line
6. Call internationally – success

After a lengthy identity verification, the registrar renewed the domain at 04:15.

DNS Restoration

# Continuously monitor DNS resolution
while true; do
  echo "=== $(date) ==="
  dig www.example.com +short
  sleep 60
done
# Output showed empty responses until 04:50, then the correct IPs appeared.

Global DNS caches needed time to propagate the new records.

Accelerating Recovery

# 1. Purge CDN cache (if any)
# 2. Ask cloud provider to flush DNS
# 3. Manually query multiple public DNS servers to trigger updates
for dns in 8.8.8.8 8.8.4.4 1.1.1.1 9.9.9.9 208.67.222.222; do
  echo "Testing $dns:"
  dig @$dns www.example.com +short
 done

By 05:24 all external probes turned green and the incident was declared resolved.

Post‑mortem Analysis

Direct Cause

The domain expired, causing the registrar to set the status to serverHold, which broke DNS resolution.

Root Causes

Missing process: domain renewal was not part of the IT asset management workflow.

Monitoring blind spot: no alerts for upcoming domain expirations.

Single point of contact: only one person responsible, who had left.

Knowledge gap: no documented procedures for domain management.

Impact Analysis

Business impact:
- Website unavailable – 2h17m
- Mobile app unavailable – 2h17m
- API services down – 2h17m
- ~200 user complaints
- Estimated loss ~800k

Technical impact:
- DNS resolution completely failed
- CDN cache invalidated
- Some SSL certificates could not be verified
- Internal services that used public DNS names failed

Improvement Measures

Immediate (within 1 week)

Domain renewal: Extend all critical domains for at least 5 years.

Add multiple contacts: Register primary, secondary, and team contacts for each domain.

Enable auto‑renew: Bind a corporate credit card to the registrar for automatic payment.

Short‑term (within 1 month)

Domain monitoring system:

#!/usr/bin/env python3
import whois, smtplib
from datetime import datetime, timedelta
DOMAINS = ['example.com', 'example.cn', 'example-api.com']
ALERT_DAYS = [90,60,30,14,7,3,1]
def check_domain_expiry(domain):
    w = whois.whois(domain)
    expiry = w.expiration_date
    if isinstance(expiry, list): expiry = expiry[0]
    return expiry
def days_until(expiry):
    if not expiry: return -1
    return (expiry - datetime.now()).days
def send_alert(domain, days, expiry):
    print(f"ALERT: {domain} expires in {days} days ({expiry})")
for d in DOMAINS:
    exp = check_domain_expiry(d)
    days = days_until(exp)
    if days <= 0 or days in ALERT_DAYS:
        send_alert(d, days, exp)
if __name__ == '__main__':
    main()

Asset management workflow: Record domain name, registrar, expiry date, contacts, and purpose in the IT asset system; automate daily checks and alert at 90/60/30/7 days before expiry.

Prometheus monitoring:

groups:
- name: domain_expiry
  rules:
  - alert: DomainExpiryWarning
    expr: domain_expiry_days < 30
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Domain is about to expire"
      description: "Domain {{ $labels.domain }} will expire in {{ $value }} days"
  - alert: DomainExpiryCritical
    expr: domain_expiry_days < 7
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "Domain expiry critical"
      description: "Domain {{ $labels.domain }} will expire in {{ $value }} days, renew immediately"

Long‑term

Multiple DNS providers: Configure primary (Cloudflare) and secondary (AWS Route 53) NS records with health checks and automatic failover.

Disaster‑recovery drills: Quarterly simulations of DNS provider failures, domain expiration, and switch‑over procedures.

Avoid internal service dependence on external DNS: Use internal service discovery (Kubernetes ClusterIP) or private DNS zones for intra‑cluster communication.

DNS Outage Checklist

# DNS outage checklist
## 1. Verify DNS resolution
 dig www.example.com +short
 nslookup www.example.com 8.8.8.8

## 2. Check domain status
 whois example.com | grep -E "Status|Expiry"

## 3. Trace DNS path
 dig www.example.com +trace

## 4. Inspect authoritative DNS
 dig example.com NS
 dig @ns1.provider.com www.example.com

## 5. Review DNS records
 dig www.example.com A
 dig www.example.com CNAME
 dig example.com MX
 dig example.com TXT

## 6. Check provider status page
 # visit provider status URL

## 7. Test network connectivity
 ping ns1.provider.com
 traceroute ns1.provider.com

## 8. Common DNS response codes
 # NOERROR – OK
 # NXDOMAIN – name does not exist
 # SERVFAIL – server error
 # REFUSED – query refused

## 9. Domain status meanings
 # ok/active – normal
 # serverHold – registrar hold (often unpaid)
 # clientHold – registrar hold
 # pendingDelete – pending deletion
 # redemptionPeriod – in redemption

Key Lessons

Domain names are foundational infrastructure; their failure nullifies any architectural improvements.

Automate everything possible: renewals, monitoring, alerts.

Redundancy is essential – multiple contacts and DNS providers mitigate single‑point failures.

Regular drills expose hidden weaknesses before real incidents occur.

After‑effects

All domains now monitored; alerts fire 90 days before expiry.

Renewal process fully automated with corporate credit‑card billing.

Quarterly DNS disaster‑recovery drills instituted.

No further domain‑related outages have occurred as of 2025.

incident responseDNSOutagedomain renewal
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.