Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert
This article details the complete redesign of our internal APM system, covering the motivations, architecture choices, metric collection pipeline, integration of VictoriaMetrics and VMAlert, metric and alert design principles, implementation steps, visualizations, performance gains, and future plans for scaling and SaaS‑ification.
Background
As the company grew, the original APM system showed many pain points: unsatisfactory performance statistics, single‑dimensional analysis, frequent trace loss, inflexible alarm configuration, and limited alarm rules. Dozens of issues and requirements made further changes impossible in the old system, so a new APM‑v3 project was launched at the beginning of the year.
The author was responsible for metrics and alarm implementations, which are core to the APM system alongside trace data. The original alarm logic ran in a consumer, could not aggregate metrics, had rigid rule configuration and single‑channel notifications, and lacked claim and suppression capabilities. The new design adopts a Prometheus‑like solution, ultimately selecting VictoriaMetrics + VMAlert.
Architecture Design
After component selection, the overall architecture for the metric service was defined:
Front‑end applications report trace‑compatible data to various message queues (WEB, mini‑program, nodejs, APP) using the OpenTelemetry trace format, enabling both call‑chain recording and metric extraction.
An exporter service consumes queue data, computes metrics according to APM configuration, and registers with VictoriaMetrics service discovery.
VictoriaMetrics pulls the metrics at regular intervals and persists them.
The alarm workflow is designed at the bottom of the diagram.
Development Design
3.1 Understanding Metrics
Metrics are exposed as plain‑text lines containing a name, labels, and a value, for example:
<code># HELP metric1 This is metric 1
# TYPE metric1 counter
metric1{label1="value1", label2="value2"} 123
metric1{label1="value1", label2="value3"} 456</code>Actual examples are shown in the following image.
When pulling metrics via HTTP, payload size must be bounded to avoid timeouts. Design principles include limiting label cardinality (e.g., converting IPs to cities), controlling the number of label combinations, and pulling no more than 200 k metric rows every 30 seconds.
3.2 Application Design
The processing flow is:
Subscribe to queue data → Message dispatcher → Application‑level condition matching → Metric‑level condition matching → Metric generation.
Key components:
ApmConfigServer & MetricsConfigService : Listen to configuration changes and update in‑memory metric settings.
Message Dispatcher : Provides a unified subscription interface; supports flexible matching conditions (equals, not equals, contains, regex) and generates static matcher functions for high‑performance filtering.
ApmApp : Isolates metric data per application or per application‑business pair, each exposing a metric endpoint discovered by VictoriaMetrics.
Metric : Supports COUNTER, SUM, HISTOGRAM, SUMMARY; includes safeguards against label explosion by limiting label values and wrapping the prom‑client library.
3.2.4 Chart Presentation Design
Metrics are queried via VictoriaMetrics’
query(aggregate over a time range) and
query_range(time‑series for charts) APIs, then visualized as line, bar, pie, or distribution charts.
3.2.5 Alarm Design
Custom alarms are essential. Rules can be configured via a semantic UI for simple cases or raw PromQL/MetricQL expressions for complex scenarios. Supported conditions include count, rate, period‑over‑period, and thresholds. Multiple channels (SMS, WeChat Work, phone, robot) are available, with claim and suppression capabilities.
VMAlert reads rules from CRDs (VmRule) in Kubernetes, allowing real‑time rule synchronization without direct file access.
Practice Results
4.1 Data Display
Charts provide clear visibility of trends, anomalies, and performance bottlenecks, enabling continuous optimization of stability and user experience.
Examples include overview dashboards, page‑performance charts, request‑distribution charts, and Grafana visualizations.
4.2 Alarm Effectiveness
Custom rules (dozens) reduced daily alarm volume from hundreds to dozens, indicating improved front‑end quality.
Results & Advantages
Aggregation Performance : New APM offers flexible, drag‑and‑drop queries and leverages a dedicated metric store for faster chart rendering.
Aggregation Capability : Supports large‑scale aggregation suitable for dashboard monitoring.
Multi‑Application Types : Handles node, H5, mini‑program, and future APP types with tailored views.
Chart Capability : Generates various chart types and integrates with third‑party tools like Grafana.
Alarm Dimensions : Allows user‑defined rules, multiple channels, claim, and suppression features.
The new APM dramatically improves performance, aggregation, type adaptation, charting, and alarm dimensions.
Project Integration
Integrating an application requires three simple steps:
Create the application (choose type, owner, default alarm contacts) and obtain a reporting token.
Insert initialization code at the entry point, e.g.:
<code>import { init } from '@xxx/node-agent'; // internal only
global.tracer = init({
service: 'your-app-name',
token: 'application-key',
ignorePath: ['/healthcheck']
});</code>View default charts in the APM console, then create custom charts and alarm rules as needed.
Future Plans
Goals include scaling the message processing to >20 k msg/s, extending metric retention for long‑term trend analysis, and SaaS‑ifying the platform for third‑party developers.
Weimob Technology Center
Official platform of the Weimob Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.