Operations 14 min read

Evolution of JD Digital Technology’s Host Monitoring System “Diting”: Architecture from V1 to V3

The article chronicles the design, implementation, and iterative evolution of JD Digital Technology’s in‑house host monitoring platform Diting, detailing its V1, V2, and V3 architectures, the challenges encountered at each stage, and future directions toward intelligent, automated operations.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Evolution of JD Digital Technology’s Host Monitoring System “Diting”: Architecture from V1 to V3

Abstract Diting is JD Digital Technology’s self‑developed host monitoring system that collects performance data from all business services, generates over 10,000 daily alerts, and processes more than 800 GB of data per day across ten regions and four countries.

Background & Requirements Early on, disparate open‑source tools (Nagios, Cacti, Zabbix) caused inconsistent monitoring, fragmented alerts, and manual reporting. Key pain points included coverage gaps, data loss, scalability to tens of thousands of nodes, integration with business and organizational metadata, and on‑demand performance reporting.

V1 – Initial Architecture

miicoo – Python agent deployed on every server, collects data and pushes to paaraa.

paaraa – Room‑level proxy that buffers data and forwards it to a message queue.

Unified message queue – decouples producers and consumers.

dt‑MQ – extracts messages, performs alarm evaluation, and stores raw data in MongoDB.

MongoDB – stores time‑series monitoring data.

MySQL – stores relational metadata (business, org structure, alarm thresholds).

DT‑monitor – web UI for visualization and reporting.

V1 satisfied the original five requirements within six months and powered performance reports for major sales events (618, Double 11).

Problems Identified in V1

All components were Python‑based; packaging exceeded 200 MB, and binary compatibility was problematic.

Alert policies were monolithic; customizing per‑service required manual edits to miicoo configs and re‑pushing to dt‑MQ.

Test‑induced alerts polluted production alert streams, making real faults hard to spot.

Version management of miicoo was entirely manual, infeasible at scale.

DT‑monitor UI became sluggish when rendering charts for tens of thousands of nodes.

V2 – Refined Architecture

Added dt‑mgt for centralized node management and automatic miicoo upgrades.

Re‑implemented miicoo in Go, producing a single binary miicoo without a heavy runtime.

Introduced a long‑connection protocol and dynamic load‑balancing for paaraa nodes.

Integrated a Spark streaming component for high‑throughput alarm computation.

Added a dedicated Alarm service for complex alert strategies.

Adopted dual storage: Cassandra for chart‑specific data (fast rendering) and MongoDB for raw long‑term data.

V2 successfully handled the 618 and Double 11 peaks, but new issues emerged.

New Issues after V2

Miicoo version proliferation made maintenance arduous across diverse OS and hardware.

Alert configurations remained too rigid; a single policy per node caused conflicts between developers and operations.

Third‑party monitoring scripts grew complex, requiring indirect invocation via message subscriptions and SSH.

V3 – Current Architecture

Split miicoo into a core framework and plug‑in modules; DT‑softCenter manages plug‑in lifecycle, enabling independent updates.

Refactored the Alarm component into a modular service, allowing per‑user and per‑application alert thresholds without cross‑interference.

Retained dual storage (Cassandra + MongoDB) and enhanced configuration granularity (collection, alarm evaluation, alarm delivery).

Future Outlook The platform aims to evolve toward intelligent operations, where thresholds are auto‑derived from historical data and anomalies are predicted proactively, reducing manual intervention.

distributed systemsmonitoringArchitecturebig dataoperationsscalabilityalerting
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.