Mastering Log Management: 16 Rules to Boost System Reliability
This article presents a comprehensive set of logging best‑practice rules—from defining log levels and classifications to using RequestIDs, monitoring alerts, and managing log size—aimed at improving system reliability, troubleshooting speed, and operational efficiency.
Preface
Logging records user actions, system status, and other events; it is a crucial component of any system but is often neglected until a problem occurs, revealing many issues.
Good logging speeds up problem location and can reveal risks before incidents.
We analyzed and optimized logs during the development and operation of NOS (Netease Object Storage) and summarize the experience.
Collected Experience
1. About Log Levels
Common log libraries (e.g., log4j) define levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL. In practice, deciding which events belong to which level requires discussion.
Project log level definitions should be clear and followed by all developers.
Even TRACE/DEBUG logs need a standard format so that developers, testers, and operators can locate issues.
Guidelines for each level:
FATAL – unrecoverable system errors that abort the application.
ERROR – errors that affect users and require immediate handling.
WARN – potentially harmful situations that may become errors if not addressed.
INFO – normal operation messages; should not exceed 10% of TRACE volume.
DEBUG or TRACE – detailed tracing for precise diagnosis.
Rule 1: The whole team (including operations) must have clear definitions for log levels and handling procedures.
2. Keep Log Content Updated
DEBUG/TRACE logs are crucial for troubleshooting; they must be complete, non‑redundant, and uniformly formatted. Good practices include:
Define a team‑wide standard for DEBUG/TRACE logs.
Regularly review logged content across development, operations, and testing.
Developers should perform operations while debugging to improve logging.
Operations and testing must promptly report observed issues to developers.
Rule 2: Periodically optimize log content to enable fast and accurate problem location.
3. Log Classification
Logs can be categorized by purpose: diagnostic, statistical, audit.
Diagnostic logs include request entry/exit, external service calls, resource usage, fault‑tolerance actions, exceptions, background tasks, startup/shutdown, etc.
Statistical logs include user access statistics and billing information.
Audit logs record administrative actions.
Rule 3: Clearly define the purpose of each log type and classify accordingly.
4. Avoid Useless Logs
In many applications, Fuse requests generate many NoSuchKey errors, flooding logs with meaningless entries.
Rule 4: Never print useless logs that drown out important information.
Solution: Add a User‑Agent header for Fuse requests and suppress NoSuchKey logs when detected.
5. Log Information Must Be Complete
Example: multipart upload workflow (InitMultiUpload, UploadPart, CompleteMultiUpload). Missing UploadID in logs made troubleshooting hard.
Recommended log items include system initialization parameters, all errors, all warnings, before/after values for persisted data, inter‑module request/response, important state changes, and long‑running task progress.
Do not log function entry points unless they represent significant events, large message bodies, or benign errors.
Rule 5: Log information should be accurate and comprehensive enough to locate problems solely from logs.
6. Test Logs
Test logs should contain environment, initial state, detailed steps, interaction info, expected and actual results.
Rule 6: Apply the same strict standards to test program logs.
7. Improve Logs from Issues
When an incident takes long to locate, refine logging; use logs to anticipate future problems.
Rule 7: Log optimization is a continuous effort that learns from errors.
8. RequestID
RequestID is generated from request time, server IP, and a random number, enabling identification of the handling machine.
./decode.sh 4b2c009a0a7800000142789f42b8ca96 Thu Nov 21 11:06:12 CST 2013 10.120.202.150 4b2c009a
Rule 8: Encode as much information as possible into RequestID.
9. Associate Entire Request Flow with RequestID
Missing RequestID in error stacks made correlation difficult; downstream services (video, image) also need the same RequestID.
Rule 9: Link the whole processing flow of a request with a unique RequestID.
10. Log Level on Production Machines
DEBUG logs are too verbose for all machines; keep INFO on most, enable DEBUG on one machine for troubleshooting.
Rule 10: Keep one machine with DEBUG level enabled.
11. Post‑deployment Log Observation
After a new feature rollout, observe logs (e.g., bucket cache operations) on the DEBUG‑enabled machine before scaling.
Rule 11: Observe logs after new servers go live to verify feature correctness.
12. Slow‑Operation Logs
Record request start and end times; promote to WARN if duration exceeds a threshold, also for external dependencies.
Rule 12: Use log level escalation to discover potential problems.
13. Log Alerts
Error count monitoring triggers alerts; keyword alerts (e.g., “Quota Warning”) inform users of capacity issues.
Rule 13: Monitor logs with alerts to detect issues before customers notice.
Rule 14: Use keywords in logs to determine system health.
14. Log Format
Inconsistent formats hinder automation; use a common logging function to standardize.
Rule 15: Keep log format uniform and standardized.
15. Separate Error Log Files
During high concurrency, isolate error logs into a dedicated file for easier analysis.
Rule 16: Output error logs to a separate file for analysis.
16. Log File Size Management
Split logs by day or hour based on volume; regularly delete old logs (e.g., older than 60 days) and collect them centrally.
Rule 17: Establish policies for log size, rotation, and deletion.
Summary of Experience
The whole team must have clear log‑level definitions and handling procedures.
Periodically optimize log content for fast, accurate issue location.
Define log purposes and classify accordingly.
Avoid useless logs that drown out important information.
Ensure logs are accurate and complete enough to locate problems.
Apply strict standards to test program logs.
Log optimization is an ongoing effort learning from errors.
Encode as much information as possible into RequestID.
Associate the entire request flow with a unique RequestID.
Enable DEBUG on one machine.
Observe logs after new servers are deployed to verify functionality.
Use log‑level escalation to discover potential problems.
Monitor logs with alerts to detect issues before customers notice.
Use keywords in logs to assess system status.
Maintain a uniform log format.
Separate error logs for analysis.
Establish policies for log size, rotation, and deletion.
References
[1] “Optimal Logging” Anthony Vallone, Google
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
