P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

Background & Traffic

The head‑quarter e‑commerce platform "A" expected GMV >300 billion CNY for the Double‑11 event. Three months before the event the system was scaled to eight‑times daily traffic, targeting a peak order‑processing capacity of 470 k orders / s. Eureka was the core service‑discovery component handling registration and lookup for >12 k micro‑service instances (order, cart, payment, etc.).

00:00‑00:05 order‑creation requests surged from 20 k / s to 900 k / s (45× daily peak).

Product‑detail page PV reached 12 M / s, driving a massive increase in Eureka queries.

Mobile traffic accounted for 92% of requests, three times the per‑request call frequency of PC traffic.

Incident Timeline

00:05 – Monitoring alarm: Eureka node CPU 98 %, registration latency >3 s.

00:12 – Product‑detail pages start failing (Eureka lookup timeout).

00:18 – Order‑submission timeout spikes from 0.3 % to 72 %.

00:25 – Payment chain broken – users cannot navigate to the payment page.

00:40 – Manual scaling of Eureka nodes and cut‑off of non‑critical registration traffic.

01:15 – New nodes join the cluster but data sync fails (Eureka cache overload).

02:00 – Partial recovery of core transaction chain (North‑China region prioritized).

02:47 – Nationwide transaction systems restored; payment success rate back to 99.8 %.

Phase 1 – Eureka Overload (00:05‑00:12)

Traffic shock

During the first five minutes heartbeat requests rose from the normal 150 k / s to 1.3 M / s, while service‑lookup queries jumped from 80 k / s to 650 k / s. The coupon service was expanded from 300 instances to >5 000 across 20 zones, creating a "heartbeat storm".

# Pre‑deployment nodes: 1 200 (8 CPU 16 GB)
# After scaling at 00:00
├── North China: 1 560 nodes
├── East China: 1 920 nodes
└── South China: 1 520 nodes
Total instances: 5 000+ (≈40 % over design capacity)

Default heartbeat interval 30 s (≈167 heartbeats / s per instance) multiplied by retry logic → actual 1.3 M / s, far exceeding Eureka’s recommended 500 k / s limit.

Metadata synchronization (peer‑to‑peer) required ~260 MB / s (2 KB per instance).

Registry memory grew from 4 GB to 16 GB, causing OOM pressure.

Resource exhaustion

CPU thread‑pool (eureka.server.peer-node-read-thread-pool-size=50) fully blocked.

Jackson serialization consumed ~60 % of CPU.

Swap activity added another ~30 % CPU load.

All three nodes reached 100 % CPU for >120 s.

GC pressure

-Xms16g -Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200   // target 200 ms, actual 8‑12 s
-XX:G1OldCSetRegionThresholdPercent=10
-XX:+HeapDumpOnOutOfMemoryError

// GC log excerpt
00:05:22 - CPU Load: 3.8 → 16.2 (full load)
00:06:15 - Old Gen: 1.2G/4G → 3.9G/4G
00:06:43 - Full GC pause: 8‑12 s

Old‑Gen usage jumped from 50 % to 98 % in three minutes, triggering Full GC pauses of 8‑12 s that paused all business threads.

Phase 2 – Registry Collapse (00:12‑00:25)

Partial service disconnection

Eureka entered a "semi‑paralysed" state; new registrations were rejected and client caches (default 30 s) stopped refreshing for 15 minutes.

Core services (order, inventory, payment) could not obtain fresh instance lists, leading to "service not found" errors.

Ribbon (client load‑balancer) with a default 2 s timeout marked ~90 % of healthy instances as DOWN, overloading the remaining 10 %.

Configuration sync chaos

Dynamic routing rules in Spring Cloud Gateway failed to propagate, causing flash‑sale traffic to fall back to the default cluster.

Test‑environment flags were mistakenly applied to production, routing ~5 % of orders to test databases.

Cross‑region orders were forced to the Shanghai cluster, adding ~300 ms latency.

Phase 3 – Full‑Chain Snowball (00:25‑02:47)

Order service – payment block

~720 k orders were stuck in "awaiting payment" because the order service could not resolve the payment service address via Eureka; cached addresses pointed to overloaded instances.

Inventory service – oversell

13 k items were oversold (mainly 3C products). Redis‑MySQL inventory sync halted, leading to a 37 k item discrepancy.

Marketing – coupon failure

High‑value coupons (e.g., ¥300 off ¥1 000) were treated as invalid, affecting ~80 k users and causing complaints.

Payment gateway – duplicate charges

Payment retries (default 3) combined with missing callback delivery caused ~2 000 duplicate deductions, totaling ¥580 k.

Key Technical Fault Points

Architecture: Single Eureka cluster served all 12 business lines; no isolation.

Capacity planning: Simple "daily‑peak ×3" formula ignored instance count, retry amplification, and metadata size, resulting in 48 % over‑capacity.

Coupling: Synchronous RPC calls tightly bound to Eureka lookups; no async or message‑queue fallback.

Monitoring gaps: Only total heartbeat count monitored; missing heartbeat‑rate, registration‑rate, and latency alerts.

Emergency procedures: Restart wiped all registration data; Nacos switch took 37 min and suffered format incompatibility.

Data Comparison

Eureka QPS limit: configured 500 k / s, observed 1.3 M / s (160 % over limit) → thread‑pool blocked.

Service instance capacity: configured 3 500, observed 5 200 (48 % over capacity) → memory 16 GB, frequent Full GC.

Registry update alert threshold: configured 60 s (mis‑configured to 6000 s), observed 895 s → no alert.

API gateway timeout: configured 3 s, average 15 s → thread‑pool exhaustion, request queueing.

Root Cause Analysis

Single‑point dependency on Eureka for all services.

Optimistic capacity planning that ignored instance count, retry amplification, and metadata growth.

Strong synchronous coupling between order, payment, and inventory services and the service‑discovery layer.

Absence of traffic‑level protection (heartbeat throttling, rate limiting).

Insufficient monitoring and lack of business‑level alert correlation.

Post‑mortem Solutions

Emergency handling improvements

Adopt "single‑node restart + data sync" to preserve ≥80 % of registration data.

Automate Nacos‑Eureka switch with format translation; cut‑over time <100 ms.

Dual‑track registration redesign

Replace Eureka with a two‑track system:

Apache ShenYu for core transaction services (CP mode, ZooKeeper strong consistency).

Nacos for non‑core services (AP mode, high availability).

Introduce RegistrySwitch component for sub‑second failover.

Service mesh integration

Deploy Istio sidecars to cache endpoint discovery (EDS) locally, decoupling east‑west traffic from the central registry. North‑south traffic still uses ShenYu/Nacos with fine‑grained routing rules.

Dynamic heartbeat throttling – SmartHeartbeat

public class SmartHeartbeatManager {
    private final EurekaClient eurekaClient;
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
    private int currentInterval = 15; // seconds
    public void start() {
        scheduler.scheduleAtFixedRate(this::sendHeartbeat, 0, currentInterval, TimeUnit.SECONDS);
        scheduler.scheduleAtFixedRate(this::adjustInterval, 5, 30, TimeUnit.SECONDS);
    }
    private void adjustInterval() {
        double cpu = fetchRegistryCpuUsage();
        if (cpu > 80) currentInterval = 300;      // severe load
        else if (cpu > 60) currentInterval = 30; // moderate load
        else currentInterval = 15;               // normal
    }
}

Heartbeat interval adapts to registry load: 15 s (normal), 30 s (moderate), 300 s (severe).

Registry circuit breaker – RegistryCircuitBreaker

Trigger when CPU > 85 % or memory > 90 % or latency > 1 s.

Reject new registrations (503) but keep query path alive.

Prioritize heartbeats from services tagged priority=core.

Auto‑scale nodes via cloud‑provider API.

Chaos engineering practice

Monthly fault‑injection drills simulate registry crashes, network partitions, and node loss.

Scenarios: shut down one registry node; create cross‑region network split; shut down two nodes + 50 % core service instances.

Metrics tracked: detection time < 30 s, auto‑recovery < 5 min, core‑service success > 99.9 %.

These measures aim to eliminate the single point of failure in service discovery, provide adaptive load protection, and ensure rapid, automated recovery for future high‑traffic events.

JavaMonitoringmicroservicesservice discoverycapacity planningfault toleranceEurekaincident analysis
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.