Operations 13 min read

How We Built a Scalable Monitoring & Alert System for Large‑Scale Transportation Services

This article explains how the team designed and implemented a unified monitoring and alert platform for a multi‑service transportation business, covering architecture, data collection, storage, rule engine, alert delivery, troubleshooting aids, encountered pitfalls, and future enhancements.

Mafengwo Technology

May 31, 2019

How We Built a Scalable Monitoring & Alert System for Large‑Scale Transportation Services

Architecture Design and Implementation

The monitoring system aims to provide three core capabilities: automatic alerts for common components, custom business alerts, and rapid problem localization. The overall architecture consists of a web UI for rule management, a core alert engine, and a data layer built on Kafka and Elasticsearch, as illustrated below.

1. Data Collection

Metrics are reported from applications via logs, UDP, etc., using the internal MES data‑analysis tool. Automatic instrumentation covers HTTP, SQL, and the Dubbo RPC framework, while developers can add custom business metrics through provided APIs.

2. Data Storage

Collected dynamic metrics are stored in Elasticsearch, chosen for its schema‑less field handling and horizontal scalability, which accommodates the massive volume of per‑request logs and supports aggregation functions such as count, sum, and avg.

3. Alert Rules

Alert rules follow a three‑stage "filter‑aggregate‑compare" process. Filters narrow the dataset, aggregation computes a numeric value (e.g., count of errors), and comparison checks the result against a threshold. An example rule uses the fast‑el expression engine:

failedCount/totalCount>0.8 && totalCount>100

4. Automatic Default Rules

For common components like Dubbo and HTTP, default rules are generated automatically and stored in MySQL with Redis caching. Developers can fine‑tune these rules via the management UI.

5. Alert Actions

When a rule triggers, alerts are sent via email and WeChat, with plans to introduce tiered alert channels based on severity.

6. Assistance for Issue Localization

The system extracts a tracer_id from hit samples and provides a direct link to Kibana for log inspection, enabling developers to quickly pinpoint the problematic service.

Pitfalls and Evolution

Memory spikes : Sudden influx of MES logs overwhelmed the consumer; mitigated by rate‑limiting Kafka pulls with Guava's RateLimiter.

Elasticsearch slowdown : Large log volume led to performance degradation; resolved by partitioning indices by application and month.

Frequent Full GC : A custom Logback appender cached logs before Spring initialization; fixed by ensuring proper initialization and using the soft‑reference mode.

Future Plans

Improve usability with more guidance.

Support additional alert dimensions (e.g., MQ, Redis, scheduled tasks).

Introduce graphical dashboards for metric visualization.

Conclusion

The monitoring and alert system provides flexible rule configuration, automatic component coverage, and easy integration for any MES‑enabled service, forming the first step in a fast, reliable online problem‑resolution workflow.

Monitoring operations Elasticsearch Kafka alerting

Written by

Mafengwo Technology

External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.