ERROR Log Governance and Monitoring Alerting Practice at Youzan
Youzan’s log‑governance guide uses a car‑dashboard analogy to show why precise ERROR logs and sensible alerts matter, defines INFO/WARN/ERROR levels, sets daily reduction targets, leverages top‑error analysis and water‑level monitoring, and ultimately cut daily ERROR entries from thousands to about one hundred while catching issues before incidents.
This article shares Youzan's practical experience in ERROR log governance and monitoring alerting systems. The author uses a car dashboard analogy to illustrate the importance of monitoring: a car without a dashboard is like a bicycle that anyone can ride; a car with a dashboard but no warning lights can be driven but may have hidden dangers; a car with all ERROR warnings would frighten anyone away.
The core problems addressed include: excessive ERROR-level logs that are imprecise, unclear, and incomplete; excessive alerts that become overwhelming and get ignored. The author emphasizes that logging serves multiple purposes: recording user behavior, facilitating rapid problem diagnosis, tracing program execution, tracking data changes, enabling statistical analysis, and collecting runtime environment data.
For log level selection, the author provides clear guidelines: INFO for normal operational states; WARN for unreasonable but non-critical situations; ERROR for system errors that prevent goal completion and require manual intervention. The key principle is: if an alert requires human intervention, log it as ERROR; otherwise, use WARN.
The governance approach involves: setting daily ERROR log reduction targets, using operations platforms to identify Top 10 error types, analyzing disk I/O impact, and continuously addressing production errors until targets are met.
Additional benefits discovered during governance include: discovering hidden bugs, identifying MySQL timeout and index issues at scale, optimizing upstream/downstream timeout configurations, cleaning up obsolete code and interfaces, and pushing dependencies for improvements.
The article also discusses water-level monitoring for WARN logs that normally stay at low levels but spike when problems occur, as well as business data monitoring (e.g., daily transaction statistics) to detect anomalies.
Results achieved: Most online issues are now detected by monitoring alerts before escalation to incidents; ERROR log types reduced from hundreds to single digits; daily ERROR log volume dropped from peak thousands to approximately 100.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
