Building an Observability System Traffic Distribution Diagram
This article explains how to design and implement a traffic distribution diagram for an observability system, covering current cloud‑native tooling, data standardization, transformation, traffic‑flow modeling, aggregation, storage with ClickHouse, and visualisation techniques such as Sankey diagrams.
Building an Observability System Traffic Distribution Diagram
With the rapid growth of cloud‑native technologies, observability has become a core element of modern application development, deployment, and maintenance. The article outlines the motivation, methodology, and results of constructing a system‑service traffic distribution map.
Current Situation
The Cloud Native Computing Foundation (CNCF) supports many observability tools for metrics, logs, and tracing. In our company we use VictoriaMetrics, SkyWalking, and ELK for metrics, tracing, and log collection respectively, but data integration remains a challenge.
To help developers grasp the overall system state, we propose a visual traffic distribution map.
Basic Data Modeling
2.1 Data Standardization
Standardizing and processing data is essential in observability. The following naming conventions are used to keep consistency across systems:
Name
Meaning
Description
Level
biz
Business line
Business system of the service
1
plt
Product line
Product system of the service
2
sid
System
Independent external service system
3
mdl
Module
Collection of services with the same function
4
srv
Service
Aggregation of identical service instances
5
sc
Cluster
Aggregation of identical service instances in a single availability zone
6
si
Instance
Single service instance
7
2.2 Data Transformation
CI/CD System Refactoring
Refactor the CI/CD pipeline to be non‑intrusive to user code while ensuring metric‑related code follows the standard.
Log Printing Refactoring
Log tags must comply with the above conventions, and logs should include traceId and tracing data for correlation.
Tracing Data Collection
Since we use SkyWalking, we modify application start‑up tags so that SkyWalking attributes align with the standardized data.
System Traffic Modeling
The traffic model describes how requests enter through ingress points, flow between internal services, and finally reach middleware or downstream systems.
Data Processing and Storage
3.1 Data Processing
To reflect the overall system status, we perform the following aggregation steps:
Time‑based aggregation (1 min, 5 min, 10 min)
Handling data from different availability zones
Abstracting ingress‑point data
Identifying micro‑service and data information
3.2 Data Storage
ClickHouse is chosen as the storage engine for its column‑store efficiency, fast analytical capabilities, and low‑latency query performance.
Data Visualization
4.1 Chart Selection
A Sankey diagram is used to display traffic distribution.
4.2 Color Design
Bright colors indicate abnormal states
Neutral colors represent normal states
Additional legends are added for clarity
4.3 Data Normalization
Traffic values are normalized into five levels to keep the diagram concise and clear.
4.4 Focus Filtering
Exclude configuration‑center and registry‑center data
Simplify complex service structures
4.5 Demo
Display by availability‑zone dimension:
Conclusion
The system‑service traffic distribution diagram provides a macro view of the overall system health, allowing developers and operators to quickly spot potential issues. The accumulated raw data can be further analyzed to continuously deliver business value.
Yum! Tech Team
How we support the digital platform of China's largest restaurant group—technology behind hundreds of millions of consumers and over 12,000 stores.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.