Operations 12 min read

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

JD Tech
JD Tech
JD Tech
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

When JD Digital Technology was founded, its finance‑related services lacked a unified monitoring solution; teams independently deployed open‑source tools such as Nagios, Cacti, and Zabbix, resulting in fragmented alerts and cumbersome fault‑analysis workflows.

Version 1 (V1) introduced a custom stack: the miicoo agent on each server, paaraa as a per‑datacenter proxy, a message queue, dt‑MQ for alarm processing, MongoDB for raw metrics, MySQL for relational data, and the DT‑monitor web UI. This architecture satisfied the five original requirements—coverage, data durability, scalability, business‑aware configuration, and reporting—and was launched in under six months, powering the major 618 and 11.11 sales events.

After deployment, several pain points emerged: the Python‑based components inflated deployment packages (>200 MB), uniform alert policies made per‑service tuning difficult, pausing alerts during stress tests was labor‑intensive, version management of miicoo became manual and error‑prone, and the web UI rendered charts slowly under tens of thousands of nodes.

Version 2 (V2) addressed these issues by adding a dt‑mgt service for centralized node management and automatic agent upgrades, introducing a Spark streaming component for high‑throughput alarm calculation, creating a dedicated Alarm service, and splitting storage: Cassandra for chart‑specific data and MongoDB for long‑term raw metrics. The agent was rewritten in Go to produce a single binary, eliminating the heavy Python runtime, and a dynamic load‑balancing algorithm was added to the paaraa layer to keep proxy workloads balanced automatically.

Despite these improvements, V2 still suffered from complex miicoo versioning across heterogeneous hardware, rigid alarm configurations shared by developers and operations, and increasingly intricate third‑party monitoring scripts.

Version 3 (V3) refactored the system into a core‑plugin architecture: the core miicoo framework remains stable while individual plugins handle specific collection tasks, managed by a new DT‑softCenter component. The Alarm service was also split to allow per‑user policy definitions, eliminating cross‑tenant configuration conflicts and improving overall reliability.

Looking ahead, the roadmap envisions intelligent operations where thresholds are auto‑derived from historical test data, enabling predictive alerts and reducing manual tuning, thereby evolving DiTing into a more autonomous, AI‑assisted monitoring platform.

Distributed SystemsMonitoringcloud-nativesystem architecturebig dataOperations
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.