Operations 10 min read

Overwatch: A Distributed System Monitoring Platform for Real‑Time RPC Visibility

Overwatch is an open‑source distributed monitoring platform built by Dada‑Jingdong Home that collects, aggregates, and visualizes RPC traffic across thousands of micro‑services in real time, enabling engineers to quickly pinpoint the root cause of system failures using directed‑graph visualizations and CQRS‑based data queries.

Dada Group Technology
Dada Group Technology
Dada Group Technology
Overwatch: A Distributed System Monitoring Platform for Real‑Time RPC Visibility

Author Zhang Xuan, a software engineer from Nanjing University, developed Overwatch, an open‑source distributed monitoring platform used internally at Dada‑Jingdong Home to monitor massive RPC traffic across its micro‑service architecture.

Background: The backend consists of a huge number of micro‑services generating massive RPC calls (including REST, JDBC, etc.). Traditional log‑based monitoring only shows error status codes and cannot quickly locate the failing service due to cascading failures, network issues, time‑outs, and long call chains.

To address these challenges, Overwatch was designed to provide real‑time monitoring of all RPC calls and to rapidly identify the root cause when multiple services raise alerts.

Two monitoring approaches were evaluated:

Monitoring from the service provider side by collecting access logs (e.g., Tomcat access.log) via the existing log‑collection system. This approach is easy to implement but cannot detect network errors, time‑outs, or logical failures hidden behind HTTP 200 responses.

Monitoring from the service consumer side, which can objectively capture request success, network errors, time‑outs, and incorrect return values.

The consumer‑side approach was chosen, requiring an in‑process agent to collect RPC information. Data is gathered with Kafka, aggregated with Storm, and stored/displayed by the Overwatch service.

Data presentation: Instead of traditional bar/line charts, Overwatch uses a directed‑graph layout where each node represents a service. Three concentric circles encode success rates over the last 1 minute (inner), 5 minutes (middle), and 15 minutes (outer) using a blue‑yellow‑red color gradient. Node size reflects recent traffic volume, and edge color indicates the success rate of RPC calls between services.

Because the graph must also convey time‑based metrics, color is used as a third visual dimension. This design allows engineers to see the entire system state at a glance and trace dependency chains to locate the origin of anomalies.

To support fast queries of recent metrics, Overwatch adopts the CQRS (Command Query Responsibility Segregation) pattern: write operations (commands) are handled separately from read operations (queries), enabling efficient aggregation of the last 15 minutes of data without heavy per‑request computation.

The backend of Overwatch is implemented in Node.js, allowing an event‑driven architecture that fits the CQRS model. Architectural diagrams illustrate the data flow from agents → Kafka → Storm → Overwatch service.

Visualization details include:

Concentric circles per node for multi‑interval success rates.

Node size for recent request volume.

Edge color for inter‑service RPC success.

Several screenshots demonstrate the directed‑graph view, real‑time error logs, system overview, per‑service statistics, and historical queries.

Conclusion: Overwatch provides real‑time RPC monitoring and a graph‑based UI that helps engineers quickly understand overall system health and isolate failures. Future extensions include monitoring of underlying components (MySQL, Redis, message queues), support for additional RPC protocols (Thrift, gRPC), and finer‑grained metrics down to individual APIs.

References:

D3: Data‑Driven Documents – https://github.com/d3/d3

Martin Fowler on CQRS – https://martinfowler.com/bliki/CQRS.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringReal-TimeRPCKafkaCQRSvisualizationnodejs
Dada Group Technology
Written by

Dada Group Technology

Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.