Operations 11 min read

Building Dynamic Grafana Dashboards for Push System Monitoring

By instrumenting each node of ZuanZuan’s push system with a Prometheus counter labeled by node name and traceId, and visualizing these metrics in a Grafana Flowcharting dashboard that dynamically highlights the trace path, developers can instantly pinpoint failures, cutting troubleshooting time from minutes to near‑zero.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Building Dynamic Grafana Dashboards for Push System Monitoring

1 Background

The ZuanZuan push system is an in‑house product that provides an access layer for multi‑system interaction via repeated MQ forwarding. Internally it performs content filtering, do‑not‑disturb policies, vendor channel distribution, and finally issues HTTP requests to vendor channels to push to devices. Because a single push traverses many logical nodes, developers often receive reports that pushes are not received, requiring time‑consuming manual tracing of each cluster in the chain.

2 Origin of the Idea

Each request carries a traceId generated by Radar (a custom tracing component) and collected by Zipkin. When a push fails, the terminating node can be identified by visualizing forward and error nodes using the traceId as a query key. Prometheus, with its nanosecond‑level latency and low memory footprint, can store traceId as a label, allowing selective sampling in test or sandbox environments.

3 What is a Dynamic View

Dynamic View (Flowcharting) is a Grafana plugin that leverages draw.io to draw and display complex diagrams such as architecture diagrams, UML, and workflows. The plugin can dynamically bind data, interact with charts, modify colors, add links, and support regex‑based transformations.

4 Building the Dashboard

4.1 Drawing the View – A flowchart of every logical node in the push chain is created (green nodes for normal flow, yellow for error nodes). Each node has a unique status code linked to online documentation.

4.2 Reporting Data – A shared Counter is defined in a common JAR with two labels: nodeName and traceId . private static final Counter NODE_COUNTER = Counter.build().name("push_link_graph_node_monitor").help("push链路节点监控").labelNames("nodeName", "traceId").disableAutoCreateGraph(true).register(); A utility method reports data: public static void reportNodeInfoStrWithTraceId(String nodeName, String traceId) { try { if (StringUtils.isBlank(traceId)) { traceId = com.bj58.zhuanzhuan.radar.util.RadarUtils.getTraceId(); } NODE_COUNTER.labels(nodeName, traceId).inc(); } catch (Exception e) { // DO NOTHING } return; }

4.3 Creating the Grafana Dashboard – A new dashboard is created, basic information (name, tags, time range) is filled, and a variable for traceId is added (type: manual input).

4.4 Importing the Diagram – The flowchart XML is copied from draw.io and pasted into the Flowcharting panel in Grafana.

4.5 Adding PromQL Queries – The query increase(push_link_graph_node_monitor{traceId="${traceId}"}[$__rate_interval]) aggregates data by traceId and uses nodeName as the legend dimension. The query can be validated via the Table view or Query Inspector.

4.6 Defining Mappings – Mappings link Prometheus data to diagram elements: color/tooltip, label/text, link, and event/animation mappings. For example, when a node’s metric exceeds a threshold, the corresponding diagram element flashes.

5 Results and Benefits

After integrating the dynamic view, entering a traceId instantly highlights the relevant nodes in Grafana, allowing developers to locate failures (e.g., an APNs 400 BadDeviceToken error) without contacting the operations team. The average troubleshooting time dropped from over 0.25 hours per incident to near‑zero, dramatically reducing manual effort.

6 Promotion

The approach can be applied to any service that can be represented as a flowchart, enabling rapid visual diagnosis of both normal and error paths.

7 Acknowledgements

Thanks to colleagues Wang Jianxin and Zhao Hao for their assistance with data collection and dynamic view construction.

8 References

[1] Meng Weidao. ZuanZuan Push System Design and Architecture Evolution, 2023. [2] Yuan Chong. Prometheus+Grafana: How ZuanZuan built an out‑of‑the‑box monitoring system? Link [3] Flowcharting‑repository, Link [4] Introduction to PromQL, Link

JavaMonitoringPrometheusGrafanapush systemTraceIdDynamic Dashboard
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.