How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput
This article details the end‑to‑end design, node‑level splitting, metric definition, and Spring‑based implementation of SLA monitoring for a high‑volume message‑push system, showing how precise latency and vendor‑stability metrics uncovered bottlenecks, enabled rapid issue detection, and ultimately doubled overall throughput.
Introduction
Push notifications are a low‑cost lever for improving app activity, user stickiness, and retention. They raise next‑day new‑user retention, reactivate dormant users, and can revive churned users when permissions remain open.
Low‑cost activation yields a noticeable rise in short‑term retention.
Push accounts for >10% of first‑launch DAU in many content apps.
Open push permissions enable re‑awakening of churned users.
Background and Pain Points
The existing message center lacked explicit latency standards, causing a gap between business expectations and technical reality. Specific issues:
No defined latency benchmark, so business teams had no psychological expectation of delivery time.
Per‑node latency was opaque, preventing targeted optimizations.
Third‑party push channels were black boxes, making anomaly detection difficult.
Occasional code smells or abnormal code could not be detected or alerted promptly.
Monitoring Practice
SLA Monitoring Overview
SLA (Service‑Level Agreement) defines the provider’s commitment to customers. For push services the most relevant SLA dimensions are timeliness and stability, in addition to classic availability, accuracy, capacity, and latency.
System Architecture
Timeliness Monitoring
Node Splitting
The push workflow is decomposed into independent, non‑dependent nodes: authentication, user lookup, anti‑fatigue filtering, duplicate filtering, risk control, vendor invocation, etc. This granularity eliminates monitoring blind spots.
Calculating Node Latency
Each node records a start timestamp and an end timestamp; latency is computed by subtraction. Example: anti‑fatigue latency = T7 (antiFatigueConsumeTime) – T6 (checkRepeatConsumeTime).
Defining Node Metrics
Two metric families were selected:
Push volume and latency – peak latency was long and business units lacked expectations.
Node blockage volume – detects backlog during traffic spikes and informs temporary scaling decisions.
Different standards were set for high‑priority vs. batch pushes. The concrete latency thresholds are illustrated below.
Technical Implementation
Metrics are standardized and isolated from the main push flow. Latency and blockage data are collected asynchronously using Spring AOP combined with Spring Event, ensuring the monitoring code does not pollute or slow the production path.
Results of Timeliness Monitoring
Real‑time visibility of each node’s latency enabled rapid anomaly detection and guided targeted optimizations. The dashboard below shows node‑level latency distribution.
Corresponding alerts are displayed as follows.
Vendor Push Monitoring
Monitoring Metrics
Multiple third‑party channels are used. Critical metrics include:
Zero‑push alerts – immediate notification when a channel drops to zero.
Vendor success rate, receipt success rate, and click‑through rate – monitored for stability.
Daily user request volume, receipt count, and click count – provide contextual baselines.
Technical Solution
Vendor monitoring runs in a bounded‑memory queue separate from the main push pipeline, preventing monitoring overhead from affecting core throughput.
Results of Vendor Monitoring
After launch, vendor anomalies (e.g., thread creation failures, channel downtime, rule changes) were detected instantly, allowing timely mitigation.
Benefits
Early Issue Detection
Monitoring uncovered several critical incidents:
Vendor push thread creation failures caused thread‑count growth; early detection prevented service outage.
Vendor channel zero‑push events triggered immediate alerts and mitigation.
Unexpected vendor rule changes were identified, enabling rapid adaptation.
Service Performance Improvement
Timeliness monitoring revealed high latency at specific nodes and vendor SDK connection pools during peak traffic. Optimizing these components doubled overall push throughput.
Future Outlook
Current monitoring covers timeliness and vendor stability. Planned extensions include upstream push‑data metrics, funnel conversion rates, broader performance indicators, and conversion‑rate monitoring (e.g., uninstall and shield metrics) for finer‑grained control.
Conclusion
The SLA‑driven monitoring framework established clear latency standards, exposed performance bottlenecks, enabled a two‑fold throughput increase, and ensured vendor stability, delivering substantial operational value.
Code example
相关阅读:Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
