Rapid Issue Localization and Alerting for B2C Backend Using Custom Log Agent and Prometheus
This article describes how the ZhiZhuan B2C backend team built a standardized logging, custom Apollo‑based log agent, Prometheus‑driven alerting service, and first‑responsible‑person mechanism to quickly locate and resolve service timeouts, exceptions, and other production issues, even when working off‑site.
1 Why Write This Article
The ZhiZhuan B2C tech team is responsible for core BFF pages such as product detail and list, where a single request may spawn parallel calls to more than twenty downstream services, making service stability and thread‑pool health critical concerns.
From a technical perspective the call chain shows three main characteristics: high CPU usage with extensive thread‑pool usage, many RPC calls whose stability depends on downstream services, and long business call chains that are hard to control precisely.
This article explains how to quickly locate common problems and outlines a further log‑governance plan.
2 Current Situation and Problems
All troubleshooting tools are shared across the company, so each alert requires using generic platforms such as the service‑governance console, Grafana, or the log platform, which do not fit the B2C team’s typical scenarios like upstream call exceptions or timeouts.
Online alerts cannot be identified quickly.
The built‑in alerts do not directly pinpoint the issue; new Prometheus PromQL queries must be added to display timeout and exception rankings.
Company‑wide platforms cannot be jumped to quickly; engineers must manually open each tool, enter service names, time ranges, and query statements.
When working away from the office, engineers cannot conveniently open all platforms for issue localization.
3 Solution
The proposed solution addresses rapid problem identification, one‑click navigation to the appropriate platform, and off‑site troubleshooting.
The overall architecture is shown below:
Four steps: standardization → log collection → log alert & localization → alert perception.
3.1 Standardization and Adjustment
日志打印规范化 : Clean up unreasonable log outputs. Optimize log levels, e.g., change "empty password input" from error to warn. Ensure logs are printed at key nodes such as coupon claim, reservation, and flash‑sale success. Remove noisy logs unrelated to troubleshooting, like printing the entire product‑detail response.
类型与阈值规范 : Standardize exception types and adjust alert thresholds. Suppress invalid stack traces, e.g., omit timeoutException stack logs. Set dynamic thresholds per business to achieve a 99.99% stability target.
3.2 Custom Log Agent Interception Based on Apollo (Log Collection)
Log collection steps:
Apollo configuration controls exception and log‑printing dimensions.
Useless exceptions are filtered and reported to Prometheus, then visualized in Grafana.
A JavaAgent intercepts logs according to configured business, class, and method levels. Sample code:
public MethodVisitor visitMethod(int access, String name, String descriptor, String signature, String[] exceptions) {
MethodVisitor methodVisitor = super.visitMethod(access, name, descriptor, signature, exceptions);
if (STR_V.equals(descriptor) && infoLevel.contains(name)) {
// 此处省略。。拦截日志处理逻辑----------
return new LogMethodInsnVisitor(methodVisitor, className, name);
} else {
return methodVisitor;
}
}3.3 Custom B2C Business Alert Service (Log Alert & Localization)
Write custom PromQL in Prometheus to collect service timeouts and exceptions.
Prometheus aggregates alert logs and calls the B2C alert service API.
The API pushes alert messages via MQ; the B2C service consumes them, formats the data, and notifies through an enterprise‑WeChat bot.
Clicking an alert opens a dashboard showing service‑timeout ranking, service‑exception ranking, global exception list, and a toolbox for quick navigation.
The toolbox uses a custom grep‑based time‑range filter because the company log platform only supports hour‑level filtering. Example regex for 16:44‑17:14:
\(16:\(4[4-9]\|5[0-9]\):\)\|\(17:\(0[0-9]\|1[0-4]\):\)For regular connections, placeholders like {serverName}, {startTimeStamp}, {endTimeStamp}, {ip} are replaced to generate quick‑jump URLs.
3.4 First‑Responsible‑Person Mechanism (Alert Perception)
Bind the first responsible person and their leader’s phone number to each service.
If a specific type of alert (e.g., downstream service exception) exceeds a time threshold without response, an IVR call notifies the first responsible person.
If still unattended, an IVR call notifies the leader.
When someone takes over, the alert is marked as 处理人xxx跟进中 (handled by xxx).
4 Application and Effect
4.1 Fast Jump to Localization via Custom Alerts
Enterprise‑WeChat bot delivers alerts.
4.2 Quick定位 of Timeout, Exception Services and Global Exception List
Clicking an alert jumps to a page showing the relevant metrics.
4.3 One‑Click Jump to Service‑Specific Issue‑Location Platform
5 Summary
In summary, the approach starts from rapid exception localization, uses custom Prometheus PromQL to focus on business‑critical anomalies, and builds an H5 monitoring dashboard, thereby improving the ability to troubleshoot issues outdoors on non‑working days without office equipment.
6 Acknowledgements
Thanks to teammates Liu Kuangdi, Li Zhenggan, architecture team members Zhao Hao, Xiao Hang, and front‑end engineer Li Jianpeng for their contributions.
About the author
Li Dishan, backend engineer in ZhiZhuan B2C tech department, passionate about cutting‑edge technology, sharing, and coding.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.