Operations 9 min read

ERROR Log Governance and Monitoring Alerting Practice at Youzan

Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.

Youzan Coder
Youzan Coder
Youzan Coder
ERROR Log Governance and Monitoring Alerting Practice at Youzan

This article shares Youzan's practical experience in ERROR log governance and monitoring alerting systems. The author uses a car dashboard analogy to illustrate the importance of monitoring: a car without a dashboard is like a bicycle that anyone can ride; a car with a dashboard but no warning lights can be driven but may have hidden dangers; a car with all ERROR warnings would frighten anyone away.

The core problems addressed include: excessive ERROR-level logs that are imprecise, unclear, and incomplete; excessive alerts that become overwhelming and get ignored. The author emphasizes that logging serves multiple purposes: recording user behavior, facilitating rapid problem diagnosis, tracing program execution, tracking data changes, enabling statistical analysis, and collecting runtime environment data.

For log level selection, the author provides clear guidelines: INFO for normal operational states; WARN for unreasonable but non-critical situations; ERROR for system errors that prevent goal completion and require manual intervention. The key principle is: if an alert requires human intervention, log it as ERROR; otherwise, use WARN.

The governance approach involves: setting daily ERROR log reduction targets, using operations platforms to identify Top 10 error types, analyzing disk I/O impact, and continuously addressing production errors until targets are met.

Additional benefits discovered during governance include: discovering hidden bugs, identifying MySQL timeout and index issues at scale, optimizing upstream/downstream timeout configurations, cleaning up obsolete code and interfaces, and pushing dependencies for improvements.

The article also discusses water-level monitoring for WARN logs that normally stay at low levels but spike when problems occur, as well as business data monitoring (e.g., daily transaction statistics) to detect anomalies.

Results achieved: Most online issues are now detected by monitoring alerts before escalation to incidents; ERROR log types reduced from hundreds to single digits; daily ERROR log volume dropped from peak thousands to approximately 100.

Monitoringoperationsalertingsystem reliabilityError Handlinglog management
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.