Operations 6 min read

Turning Manual Performance Monitoring into Automated Multi‑Level Alerts

The author explains how they distinguished test automation from automated testing, identified monitoring pain points, built a custom scraper‑driven alert system with three escalation levels, tackled common pitfalls, and achieved faster, more reliable performance testing alerts.

FunTester

Mar 17, 2022

Turning Manual Performance Monitoring into Automated Multi‑Level Alerts

Background

The company's infrastructure team provides a monitoring dashboard that categorizes nodes by business line, service, and host, displaying metrics such as CPU, memory, JVM, Tomcat, QPS, response time, and network details.

During performance testing, engineers usually know the target interfaces and service call chains, watching relevant node metrics and stopping or adjusting load when thresholds are hit, accounting for monitoring latency and request queuing.

Pain Points

Inability to observe resource metrics across all links in real time; each node has its own monitoring page, leading to redundancy.

Unforeseen resource nodes escape predefined monitoring scopes, often due to inter‑service calls.

Standard alert rules do not suit performance testing and lack customization.

Test Automation Approach

To address these issues, the author devised a solution that uses a web‑scraping bot to collect monitoring data and a robot (messenger) to send timely alerts. The rationale includes:

Metrics are numeric and easily customizable.

Data source is reliable with stable structure.

Monitoring granularity is fine enough for the needs.

Robot notifications are flexible, reliable, and highly configurable.

Monitoring URLs follow fixed patterns, allowing one‑click navigation from alerts.

Multi‑Level Alert Push

Three alert levels were implemented, each linked to a distinct robot:

Normal Monitoring : Low‑level alerts that serve as early warnings for potential issues.

Severe Warning : Immediate alerts requiring test termination or rapid notification of stakeholders due to high resource consumption.

Exception Warning : Covers abnormal conditions such as CPU spikes, blocked threads, Tomcat thread‑pool saturation, or frequent GC events.

Pitfalls and Solutions

Monitoring spikes often stem from GC activity; the solution combines GC state analysis to set alert severity.

Service restarts cause abrupt metric changes; QPS trends are used to detect restarts.

Scheduled tasks raise resource usage; alerts consider QPS variations and average values.

Duplicate alerts occur when the same data point is scanned repeatedly; adding sleep compensation aligns monitoring and script intervals.

Threshold design references daily testing standards and leverages small‑scale data analysis from monitoring script logs.

Historical data is stored in LevelDB, avoiding reliance on external services.

Results

Significant reduction in manual monitoring effort and missed alerts.

Discovery of several production service bugs.

Early warnings (1‑2 minutes ahead) for certain service anomalies.

Alert messages now include team identifiers, greatly improving visibility.

The experience demonstrates that combining test automation with office automation concepts clarifies the meaning of "test automation" and enhances overall testing efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Performance Monitoring test automation alert system monitoring scripts

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.