Design and Implementation of a Unified Monitoring and Alert System for MaFengWo Large Transportation Business
This article describes the motivation, architecture, key components, rule engine, alert actions, and practical lessons learned while building a unified monitoring and alarm system for MaFengWo's large‑scale transportation platform, highlighting data collection, Elasticsearch storage, scheduling, and future enhancements.
The growing number of business lines in the transportation platform leads to frequent operational issues such as order volume drops, traffic declines, and system errors, especially when many third‑party services are involved.
To detect and resolve these problems early, a unified monitoring and alert system was built to provide real‑time fault detection and proactive identification of potential performance degradations.
Core capabilities include automatic alerts for common components (e.g., RPC, HTTP), custom business‑specific alerts defined by developers, and fast problem localization after an alarm is triggered.
The system architecture consists of three layers: a web management console for rule maintenance and query, a core alert engine, and a data layer. Business services integrate via the mes-client-starter JAR.
Data collection uses the internal MES big‑data analysis tool to record metrics from HTTP, SQL, Dubbo, etc., either automatically for common components or manually via developer‑defined trace points.
Data storage relies on Elasticsearch for its dynamic field support and horizontal scalability, enabling efficient aggregation (count, sum, avg) of massive metric streams.
Alert rules follow a three‑stage process—filter, aggregate, compare. Rules are executed every minute using Elastic Job. An example rule filters by app_name=A and is_error=true, aggregates with count, and compares against a threshold.
Complex conditions, such as “failure rate > 80% and total requests > 100”, are expressed with the fast‑el expression engine: failedCount/totalCount>0.8&&totalCount>100 Default rules for common components are auto‑generated and stored in MySQL with Redis caching; developers can fine‑tune them via the UI.
Alert actions currently include email and WeChat notifications, with plans to add severity‑based routing.
Assistance for troubleshooting includes hit‑sampling that extracts tracer_id and provides direct Kibana links, as well as customizable fields for quick issue identification.
Practical challenges and solutions :
Memory spikes caused by bursty MES logs were mitigated by rate‑limiting Kafka consumption using Guava’s RateLimiter (e.g., RateLimiter.create(20000)).
Elasticsearch slowdown was addressed by partitioning indices by application and month.
Frequent Full GC due to Logback’s DelegatingLogbackAppender cache was solved by ensuring proper initialization of ApplicationContextHolder and using the SOFT mode for the cache.
Future work aims to improve usability, add more alert dimensions (e.g., MQ, Redis), and provide graphical dashboards for metric visualization.
In summary, the system offers flexible rule configuration, automatic component alerts, simple integration, and serves as the first step in the production incident‑resolution workflow, ultimately improving service quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
