Operations 17 min read

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

This article details the complete redesign of our internal APM system, covering the motivations, architecture choices, metric collection pipeline, integration of VictoriaMetrics and VMAlert, metric and alert design principles, implementation steps, visualizations, performance gains, and future plans for scaling and SaaS‑ification.

Weimob Technology Center
Weimob Technology Center
Weimob Technology Center
Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

Background

As the company grew, the original APM system showed many pain points: unsatisfactory performance statistics, single‑dimensional analysis, frequent trace loss, inflexible alarm configuration, and limited alarm rules. Dozens of issues and requirements made further changes impossible in the old system, so a new APM‑v3 project was launched at the beginning of the year.

The author was responsible for metrics and alarm implementations, which are core to the APM system alongside trace data. The original alarm logic ran in a consumer, could not aggregate metrics, had rigid rule configuration and single‑channel notifications, and lacked claim and suppression capabilities. The new design adopts a Prometheus‑like solution, ultimately selecting VictoriaMetrics + VMAlert.

Architecture Design

After component selection, the overall architecture for the metric service was defined:

Front‑end applications report trace‑compatible data to various message queues (WEB, mini‑program, nodejs, APP) using the OpenTelemetry trace format, enabling both call‑chain recording and metric extraction.

An exporter service consumes queue data, computes metrics according to APM configuration, and registers with VictoriaMetrics service discovery.

VictoriaMetrics pulls the metrics at regular intervals and persists them.

The alarm workflow is designed at the bottom of the diagram.

Architecture diagram
Architecture diagram

Development Design

3.1 Understanding Metrics

Metrics are exposed as plain‑text lines containing a name, labels, and a value, for example:

<code># HELP metric1 This is metric 1
# TYPE metric1 counter
metric1{label1="value1", label2="value2"} 123
metric1{label1="value1", label2="value3"} 456</code>

Actual examples are shown in the following image.

Metric example
Metric example

When pulling metrics via HTTP, payload size must be bounded to avoid timeouts. Design principles include limiting label cardinality (e.g., converting IPs to cities), controlling the number of label combinations, and pulling no more than 200 k metric rows every 30 seconds.

3.2 Application Design

The processing flow is:

Subscribe to queue data → Message dispatcher → Application‑level condition matching → Metric‑level condition matching → Metric generation.

Processing flow
Processing flow

Key components:

ApmConfigServer & MetricsConfigService : Listen to configuration changes and update in‑memory metric settings.

Message Dispatcher : Provides a unified subscription interface; supports flexible matching conditions (equals, not equals, contains, regex) and generates static matcher functions for high‑performance filtering.

ApmApp : Isolates metric data per application or per application‑business pair, each exposing a metric endpoint discovered by VictoriaMetrics.

Metric : Supports COUNTER, SUM, HISTOGRAM, SUMMARY; includes safeguards against label explosion by limiting label values and wrapping the prom‑client library.

3.2.4 Chart Presentation Design

Metrics are queried via VictoriaMetrics’

query

(aggregate over a time range) and

query_range

(time‑series for charts) APIs, then visualized as line, bar, pie, or distribution charts.

Chart types
Chart types

3.2.5 Alarm Design

Custom alarms are essential. Rules can be configured via a semantic UI for simple cases or raw PromQL/MetricQL expressions for complex scenarios. Supported conditions include count, rate, period‑over‑period, and thresholds. Multiple channels (SMS, WeChat Work, phone, robot) are available, with claim and suppression capabilities.

VMAlert reads rules from CRDs (VmRule) in Kubernetes, allowing real‑time rule synchronization without direct file access.

VMAlert rule sync
VMAlert rule sync

Practice Results

4.1 Data Display

Charts provide clear visibility of trends, anomalies, and performance bottlenecks, enabling continuous optimization of stability and user experience.

Examples include overview dashboards, page‑performance charts, request‑distribution charts, and Grafana visualizations.

APM overview
APM overview

4.2 Alarm Effectiveness

Custom rules (dozens) reduced daily alarm volume from hundreds to dozens, indicating improved front‑end quality.

Alarm statistics
Alarm statistics

Results & Advantages

Aggregation Performance : New APM offers flexible, drag‑and‑drop queries and leverages a dedicated metric store for faster chart rendering.

Aggregation Capability : Supports large‑scale aggregation suitable for dashboard monitoring.

Multi‑Application Types : Handles node, H5, mini‑program, and future APP types with tailored views.

Chart Capability : Generates various chart types and integrates with third‑party tools like Grafana.

Alarm Dimensions : Allows user‑defined rules, multiple channels, claim, and suppression features.

The new APM dramatically improves performance, aggregation, type adaptation, charting, and alarm dimensions.

Project Integration

Integrating an application requires three simple steps:

Create the application (choose type, owner, default alarm contacts) and obtain a reporting token.

Insert initialization code at the entry point, e.g.:

<code>import { init } from '@xxx/node-agent'; // internal only
global.tracer = init({
  service: 'your-app-name',
  token: 'application-key',
  ignorePath: ['/healthcheck']
});</code>

View default charts in the APM console, then create custom charts and alarm rules as needed.

Future Plans

Goals include scaling the message processing to >20 k msg/s, extending metric retention for long‑term trend analysis, and SaaS‑ifying the platform for third‑party developers.

APMObservabilityMetricsPerformance MonitoringalertingVictoriaMetrics
Weimob Technology Center
Written by

Weimob Technology Center

Official platform of the Weimob Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.