Operations 16 min read

How JD.com Scales Network Monitoring for Massive Traffic Peaks

This article explains how JD.com’s network team continuously optimizes its large‑scale infrastructure, designs effective monitoring strategies, implements practical monitoring solutions, and outlines future directions to improve network availability, fault detection, and operational efficiency across data centers and the internet backbone.

Efficient Ops
Efficient Ops
Efficient Ops
How JD.com Scales Network Monitoring for Massive Traffic Peaks

1. JD.com Network Status

JD.com’s traffic has grown rapidly from 2014 to 2017, with DCI (dedicated line) traffic doubling during the 2017 618 promotion, driven by big‑data and log‑analysis workloads. Independent business data centers have emerged, requiring diverse hardware, performance, and reliability specifications.

JD.com traffic growth chart
JD.com traffic growth chart

Key architectural upgrades include a nationwide 100 Gbps backbone spanning Beijing, Shanghai, and Guangzhou, a rebuilt internet access layer with dual‑core BGP, and a transition from a four‑core to a dual‑core DCN design to improve scalability and manageability.

DCN architecture
DCN architecture

2. Monitoring Design Considerations

2.1 Define Monitoring Goals

Determine what “good” network performance means.

Accurately detect anomalies on core metrics.

Rapidly classify issues and trigger appropriate responses.

2.2 Define “Good” Network Standards

Network health must be judged from the user’s perspective, focusing on service availability rather than merely device status.

2.3 Effective Perception Methods

Adopt black‑box monitoring that simulates user experience while still leveraging white‑box data, prioritizing the most severe and frequent faults.

2.4 Incident Handling and Decision Mechanism

Distinguish between self‑healing issues and those requiring manual intervention, and establish clear escalation procedures.

3. JD.com Monitoring Practices

3.1 Preparation

Deploy AAA for device management, NTP for time synchronization, SNMP for data collection, Syslog for post‑event analysis, and maintain a CMDB with manual inventory of critical interfaces.

3.2 Core Monitoring

Track real‑time traffic on internet exits, POD uplinks, and DCI links, as well as 24‑hour peaks, traffic ratios, Syslog/drop/CRC totals, application performance alerts, and overall device health.

Monitoring dashboard
Monitoring dashboard

3.3 Internet Quality Cases

Examples show ISP‑specific outages, high utilization on specific internet exits, and spikes in Syslog alerts, illustrating how visual dashboards help pinpoint problems quickly.

ISP outage map
ISP outage map

3.4 DCN Quality Cases

Pingmesh‑style black‑box monitoring reveals internal data‑center packet loss and latency, uncovering issues previously assumed to be stable.

Pingmesh results
Pingmesh results

4. Future Outlook

Monitoring will evolve from simple fault detection to an automation‑enabling platform that frees engineers from repetitive analysis, improves network availability, and supports large‑scale operations. Emphasis will shift toward internet quality improvements and deeper insight into data‑center network health.

Operationsnetwork optimizationNetwork Monitoringjd.comlarge-scale networksmonitoring design
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.