Operations 13 min read

Why Use Prometheus and How It Guarantees Business System Stability

This article explains the motivations for adopting Prometheus, introduces its core components and metric types, and demonstrates how comprehensive monitoring of business‑critical data, failure events, QPS, latency, and underlying resources can improve system stability and accelerate fault response.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Why Use Prometheus and How It Guarantees Business System Stability

1. Why Use Prometheus

The recycling platform interacts heavily with external channels and relies on numerous asynchronous MQ processes, making it essential to record interface calls and MQ consumption to detect issues early.

Current problems include insufficient proactive problem detection, limited notification coverage, lack of visual dashboards, and slow troubleshooting due to missing auxiliary tools.

2. Prometheus Overview

2.1 Core Components

Prometheus server : a time‑series database that scrapes, stores, and queries metrics, exposing a query API and built‑in UI for PromQL.

Exporter : an HTTP endpoint that exposes service‑specific metrics for the server to pull.

Alertmanager : manages alert rules, routing, and silencing; works with PushGateway when direct pull is impossible.

Service Discovery : dynamically discovers targets for monitoring.

2.2 Metric Types

Counter : monotonically increasing integer, e.g., request count.

Gauge : variable value, e.g., current connections.

Histogram : buckets for distribution analysis, e.g., request latency.

Summary : similar to histogram but stores percentiles directly.

3. How Prometheus Ensures System Stability

3.1 Monitoring Core Business Data

Track order creation volume to assess channel effectiveness and trigger alerts on spikes or prolonged inactivity.

Monitor order creation failures to quickly identify and resolve issues before they affect users.

3.2 Improving Fault Response Efficiency

Monitor external‑partner interface exceptions to pinpoint root causes (network, partner API, or internal logic) and accelerate remediation.

3.3 Differentiated Alerting Based on Metric Importance

Critical metrics such as order‑creation failures generate alerts after a single failure, whereas less critical metrics like partner‑interface errors require multiple occurrences before alerting.

3.4 Monitoring Remote Call QPS and Latency

Latency monitoring guides architectural decisions, such as switching to asynchronous calls to avoid timeouts.

QPS monitoring triggers alerts when usage reaches 80 % of the configured limit, prompting capacity adjustments.

3.5 Monitoring Underlying Dependencies

Frequent Full GC events indicate memory leaks or insufficient heap size.

Database connection‑pool metrics (commit time, rollback count, transaction count) help tune pool size and detect abnormal workloads.

Container metrics (disk usage, host resources) prevent infrastructure‑level failures.

4. Summary

Comprehensive monitoring with Prometheus—covering business‑level metrics, infrastructure health, and alerting policies—provides early problem detection, faster fault resolution, performance optimization, and security risk mitigation, ultimately ensuring the stability and reliability of the recycling system.

public
R execute(String priceSource, String actionCode, T bodyParam, Class
responseDataClazz) {
    try {
        // ... build request URL and other config
        try {
            // ... send request
        } catch (ResourceAccessException e) {
            log.error("报价方接口请求超时或异常,将重试,exception:{}", ExceptionUtils.getStackTrace(e));
            throw new RetryException("报价方接口请求超时或异常", e);
        }
        if (HttpStatus.OK.value() != responseEntity.getStatusCode().value()) {
            log.info("priceSourceApi接口httpCode非200,httpCode:{}", responseEntity.getStatusCode().value());
            throw new RetryException("priceSourceApi调用失败: " + responseEntity.getStatusCode().value());
        }
        BaseResp baseResp = responseEntity.getBody();
        if (Objects.isNull(baseResp)) {
            throw new PriceSourceApiException("priceSourceApi返回值为空");
        }
        String respCode = baseResp.getCode();
        if (!priceSourceApiAction.getSuccessCode().equals(respCode)) {
            throw new RetryException("priceSourceApi调用失败: " + baseResp.getMsg());
        }
        return o;
    } catch (Exception e) {
        log.error("execute Api PriceSource fail for priceSource: {}, action: {}", priceSource, actionCode, e);
        throw new HunterErrorException("调用回收商接口失败", e);
    } finally {
        long usingTime = System.currentTimeMillis() - startTime;
        String metricsName = PrometheusMetricsEnum.OUTER_RECYCLER_HANDLE_INTER2_TOTAL.getName() + priceSource;
        MetricsMonitor.recordOne(metricsName, actionCode, usingTime);
    }
}
JavaMonitoringoperationsSystem StabilitymetricsAlertingPrometheus
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.