Overwatch: A Distributed Real‑Time RPC Monitoring Platform for System Observability
The article describes Overwatch, a distributed monitoring system developed by Dada‑JD Daojia that collects, aggregates, and visualizes RPC traffic in real time using consumer‑side agents, Kafka, Storm, and a Node.js CQRS architecture, enabling engineers to quickly locate and resolve service failures.
Background: Dada‑JD Daojia's backend consists of numerous microservices generating massive RPC traffic, making fault isolation difficult.
To address this, the Overwatch monitoring platform was developed to collect, aggregate, and visualize RPC data in real time.
Data collection is performed by agents in consumer services, sending RPC metrics via Kafka; Storm aggregates the streams, and the Node.js Overwatch service stores and serves the results.
Two monitoring approaches were considered: provider‑side log monitoring and consumer‑side instrumentation; the latter was chosen for objective error detection.
Visualization uses directed graphs where nodes represent services and concentric circles encode recent success rates (1 min, 5 min, 15 min) with color gradients, while edge colors indicate inter‑service call health.
To support low‑latency queries, Overwatch adopts a CQRS architecture separating command (data ingestion) and query (read) models.
The platform has been deployed successfully, handling peak loads of 4 million orders per day, and continues to evolve with support for additional data sources, RPC protocols, and fine‑grained metrics.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.