How We Built a Scalable Monitoring & Alert System for Large‑Scale Transportation Services
This article explains how the team designed and implemented a unified monitoring and alert platform for a multi‑service transportation business, covering architecture, data collection, storage, rule engine, alert delivery, troubleshooting aids, encountered pitfalls, and future enhancements.
Architecture Design and Implementation
The monitoring system aims to provide three core capabilities: automatic alerts for common components, custom business alerts, and rapid problem localization. The overall architecture consists of a web UI for rule management, a core alert engine, and a data layer built on Kafka and Elasticsearch, as illustrated below.
1. Data Collection
Metrics are reported from applications via logs, UDP, etc., using the internal MES data‑analysis tool. Automatic instrumentation covers HTTP, SQL, and the Dubbo RPC framework, while developers can add custom business metrics through provided APIs.
2. Data Storage
Collected dynamic metrics are stored in Elasticsearch, chosen for its schema‑less field handling and horizontal scalability, which accommodates the massive volume of per‑request logs and supports aggregation functions such as count, sum, and avg.
3. Alert Rules
Alert rules follow a three‑stage "filter‑aggregate‑compare" process. Filters narrow the dataset, aggregation computes a numeric value (e.g., count of errors), and comparison checks the result against a threshold. An example rule uses the fast‑el expression engine:
failedCount/totalCount>0.8 && totalCount>1004. Automatic Default Rules
For common components like Dubbo and HTTP, default rules are generated automatically and stored in MySQL with Redis caching. Developers can fine‑tune these rules via the management UI.
5. Alert Actions
When a rule triggers, alerts are sent via email and WeChat, with plans to introduce tiered alert channels based on severity.
6. Assistance for Issue Localization
The system extracts a tracer_id from hit samples and provides a direct link to Kibana for log inspection, enabling developers to quickly pinpoint the problematic service.
Pitfalls and Evolution
Memory spikes : Sudden influx of MES logs overwhelmed the consumer; mitigated by rate‑limiting Kafka pulls with Guava's RateLimiter.
Elasticsearch slowdown : Large log volume led to performance degradation; resolved by partitioning indices by application and month.
Frequent Full GC : A custom Logback appender cached logs before Spring initialization; fixed by ensuring proper initialization and using the soft‑reference mode.
Future Plans
Improve usability with more guidance.
Support additional alert dimensions (e.g., MQ, Redis, scheduled tasks).
Introduce graphical dashboards for metric visualization.
Conclusion
The monitoring and alert system provides flexible rule configuration, automatic component coverage, and easy integration for any MES‑enabled service, forming the first step in a fast, reliable online problem‑resolution workflow.
Mafengwo Technology
External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
