Operations 17 min read

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

This article details the complete redesign of our internal APM system, covering the motivations, architecture choices, metric collection pipeline, integration of VictoriaMetrics and VMAlert, metric and alert design principles, implementation steps, visualizations, performance gains, and future plans for scaling and SaaS‑ification.

Weimob Technology Center

Dec 26, 2023

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

Background

As the company grew, the original APM system showed many pain points: unsatisfactory performance statistics, single‑dimensional analysis, frequent trace loss, inflexible alarm configuration, and limited alarm rules. Dozens of issues and requirements made further changes impossible in the old system, so a new APM‑v3 project was launched at the beginning of the year.

The author was responsible for metrics and alarm implementations, which are core to the APM system alongside trace data. The original alarm logic ran in a consumer, could not aggregate metrics, had rigid rule configuration and single‑channel notifications, and lacked claim and suppression capabilities. The new design adopts a Prometheus‑like solution, ultimately selecting VictoriaMetrics + VMAlert.

Architecture Design

After component selection, the overall architecture for the metric service was defined:

Front‑end applications report trace‑compatible data to various message queues (WEB, mini‑program, nodejs, APP) using the OpenTelemetry trace format, enabling both call‑chain recording and metric extraction.

An exporter service consumes queue data, computes metrics according to APM configuration, and registers with VictoriaMetrics service discovery.

VictoriaMetrics pulls the metrics at regular intervals and persists them.

The alarm workflow is designed at the bottom of the diagram.

Development Design

3.1 Understanding Metrics

Metrics are exposed as plain‑text lines containing a name, labels, and a value, for example:

# HELP metric1 This is metric 1
# TYPE metric1 counter
metric1{label1="value1", label2="value2"} 123
metric1{label1="value1", label2="value3"} 456

Actual examples are shown in the following image.

When pulling metrics via HTTP, payload size must be bounded to avoid timeouts. Design principles include limiting label cardinality (e.g., converting IPs to cities), controlling the number of label combinations, and pulling no more than 200 k metric rows every 30 seconds.

3.2 Application Design

The processing flow is:

Subscribe to queue data → Message dispatcher → Application‑level condition matching → Metric‑level condition matching → Metric generation.

Key components:

ApmConfigServer & MetricsConfigService : Listen to configuration changes and update in‑memory metric settings.

Message Dispatcher : Provides a unified subscription interface; supports flexible matching conditions (equals, not equals, contains, regex) and generates static matcher functions for high‑performance filtering.

ApmApp : Isolates metric data per application or per application‑business pair, each exposing a metric endpoint discovered by VictoriaMetrics.

Metric : Supports COUNTER, SUM, HISTOGRAM, SUMMARY; includes safeguards against label explosion by limiting label values and wrapping the prom‑client library.

3.2.4 Chart Presentation Design

Metrics are queried via VictoriaMetrics’ query (aggregate over a time range) and query_range (time‑series for charts) APIs, then visualized as line, bar, pie, or distribution charts.

3.2.5 Alarm Design

Custom alarms are essential. Rules can be configured via a semantic UI for simple cases or raw PromQL/MetricQL expressions for complex scenarios. Supported conditions include count, rate, period‑over‑period, and thresholds. Multiple channels (SMS, WeChat Work, phone, robot) are available, with claim and suppression capabilities.

VMAlert reads rules from CRDs (VmRule) in Kubernetes, allowing real‑time rule synchronization without direct file access.

Practice Results

4.1 Data Display

Charts provide clear visibility of trends, anomalies, and performance bottlenecks, enabling continuous optimization of stability and user experience.

Examples include overview dashboards, page‑performance charts, request‑distribution charts, and Grafana visualizations.

4.2 Alarm Effectiveness

Custom rules (dozens) reduced daily alarm volume from hundreds to dozens, indicating improved front‑end quality.

Results & Advantages

Aggregation Performance : New APM offers flexible, drag‑and‑drop queries and leverages a dedicated metric store for faster chart rendering.

Aggregation Capability : Supports large‑scale aggregation suitable for dashboard monitoring.

Multi‑Application Types : Handles node, H5, mini‑program, and future APP types with tailored views.

Chart Capability : Generates various chart types and integrates with third‑party tools like Grafana.

Alarm Dimensions : Allows user‑defined rules, multiple channels, claim, and suppression features.

The new APM dramatically improves performance, aggregation, type adaptation, charting, and alarm dimensions.

Project Integration

Integrating an application requires three simple steps:

Create the application (choose type, owner, default alarm contacts) and obtain a reporting token.

Insert initialization code at the entry point, e.g.:

import { init } from '@xxx/node-agent'; // internal only
global.tracer = init({
  service: 'your-app-name',
  token: 'application-key',
  ignorePath: ['/healthcheck']
});

View default charts in the APM console, then create custom charts and alarm rules as needed.

Future Plans

Goals include scaling the message processing to >20 k msg/s, extending metric retention for long‑term trend analysis, and SaaS‑ifying the platform for third‑party developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

APM Observability metrics performance monitoring Alerting VictoriaMetrics

Written by

Weimob Technology Center

Official platform of the Weimob Technology Center

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.