Operations 9 min read

ERROR Log Governance and Monitoring Alerting Practice at Youzan

Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.

Youzan Coder

Dec 30, 2020

ERROR Log Governance and Monitoring Alerting Practice at Youzan

This article shares Youzan's practical experience in ERROR log governance and monitoring alerting systems. The author uses a car dashboard analogy to illustrate the importance of monitoring: a car without a dashboard is like a bicycle that anyone can ride; a car with a dashboard but no warning lights can be driven but may have hidden dangers; a car with all ERROR warnings would frighten anyone away.

The core problems addressed include: excessive ERROR-level logs that are imprecise, unclear, and incomplete; excessive alerts that become overwhelming and get ignored. The author emphasizes that logging serves multiple purposes: recording user behavior, facilitating rapid problem diagnosis, tracing program execution, tracking data changes, enabling statistical analysis, and collecting runtime environment data.

For log level selection, the author provides clear guidelines: INFO for normal operational states; WARN for unreasonable but non-critical situations; ERROR for system errors that prevent goal completion and require manual intervention. The key principle is: if an alert requires human intervention, log it as ERROR; otherwise, use WARN.

The governance approach involves: setting daily ERROR log reduction targets, using operations platforms to identify Top 10 error types, analyzing disk I/O impact, and continuously addressing production errors until targets are met.

Additional benefits discovered during governance include: discovering hidden bugs, identifying MySQL timeout and index issues at scale, optimizing upstream/downstream timeout configurations, cleaning up obsolete code and interfaces, and pushing dependencies for improvements.

The article also discusses water-level monitoring for WARN logs that normally stay at low levels but spike when problems occur, as well as business data monitoring (e.g., daily transaction statistics) to detect anomalies.

Results achieved: Most online issues are now detected by monitoring alerts before escalation to incidents; ERROR log types reduced from hundreds to single digits; daily ERROR log volume dropped from peak thousands to approximately 100.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Alerting system reliability error handling log management

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.