Operations 14 min read

How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput

This article details the end‑to‑end design, node‑level splitting, metric definition, and Spring‑based implementation of SLA monitoring for a high‑volume message‑push system, showing how precise latency and vendor‑stability metrics uncovered bottlenecks, enabled rapid issue detection, and ultimately doubled overall throughput.

Architect
Architect
Architect
How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput

Introduction

Push notifications are a low‑cost lever for improving app activity, user stickiness, and retention. They raise next‑day new‑user retention, reactivate dormant users, and can revive churned users when permissions remain open.

Low‑cost activation yields a noticeable rise in short‑term retention.

Push accounts for >10% of first‑launch DAU in many content apps.

Open push permissions enable re‑awakening of churned users.

Background and Pain Points

The existing message center lacked explicit latency standards, causing a gap between business expectations and technical reality. Specific issues:

No defined latency benchmark, so business teams had no psychological expectation of delivery time.

Per‑node latency was opaque, preventing targeted optimizations.

Third‑party push channels were black boxes, making anomaly detection difficult.

Occasional code smells or abnormal code could not be detected or alerted promptly.

Monitoring Practice

SLA Monitoring Overview

SLA (Service‑Level Agreement) defines the provider’s commitment to customers. For push services the most relevant SLA dimensions are timeliness and stability, in addition to classic availability, accuracy, capacity, and latency.

System Architecture

Timeliness Monitoring

Node Splitting

The push workflow is decomposed into independent, non‑dependent nodes: authentication, user lookup, anti‑fatigue filtering, duplicate filtering, risk control, vendor invocation, etc. This granularity eliminates monitoring blind spots.

Calculating Node Latency

Each node records a start timestamp and an end timestamp; latency is computed by subtraction. Example: anti‑fatigue latency = T7 (antiFatigueConsumeTime) – T6 (checkRepeatConsumeTime).

Defining Node Metrics

Two metric families were selected:

Push volume and latency – peak latency was long and business units lacked expectations.

Node blockage volume – detects backlog during traffic spikes and informs temporary scaling decisions.

Different standards were set for high‑priority vs. batch pushes. The concrete latency thresholds are illustrated below.

Technical Implementation

Metrics are standardized and isolated from the main push flow. Latency and blockage data are collected asynchronously using Spring AOP combined with Spring Event, ensuring the monitoring code does not pollute or slow the production path.

Results of Timeliness Monitoring

Real‑time visibility of each node’s latency enabled rapid anomaly detection and guided targeted optimizations. The dashboard below shows node‑level latency distribution.

Corresponding alerts are displayed as follows.

Vendor Push Monitoring

Monitoring Metrics

Multiple third‑party channels are used. Critical metrics include:

Zero‑push alerts – immediate notification when a channel drops to zero.

Vendor success rate, receipt success rate, and click‑through rate – monitored for stability.

Daily user request volume, receipt count, and click count – provide contextual baselines.

Technical Solution

Vendor monitoring runs in a bounded‑memory queue separate from the main push pipeline, preventing monitoring overhead from affecting core throughput.

Results of Vendor Monitoring

After launch, vendor anomalies (e.g., thread creation failures, channel downtime, rule changes) were detected instantly, allowing timely mitigation.

Benefits

Early Issue Detection

Monitoring uncovered several critical incidents:

Vendor push thread creation failures caused thread‑count growth; early detection prevented service outage.

Vendor channel zero‑push events triggered immediate alerts and mitigation.

Unexpected vendor rule changes were identified, enabling rapid adaptation.

Service Performance Improvement

Timeliness monitoring revealed high latency at specific nodes and vendor SDK connection pools during peak traffic. Optimizing these components doubled overall push throughput.

Future Outlook

Current monitoring covers timeliness and vendor stability. Planned extensions include upstream push‑data metrics, funnel conversion rates, broader performance indicators, and conversion‑rate monitoring (e.g., uninstall and shield metrics) for finer‑grained control.

Conclusion

The SLA‑driven monitoring framework established clear latency standards, exposed performance bottlenecks, enabled a two‑fold throughput increase, and ensured vendor stability, delivering substantial operational value.

Code example

相关阅读:
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesOperationsMessage Pushsystem performanceSLA monitoringvendor monitoring
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.