Operations 23 min read

Mastering Log Management: 16 Rules to Boost System Reliability

This article presents a comprehensive set of logging best‑practice rules—from defining log levels and classifications to using RequestIDs, monitoring alerts, and managing log size—aimed at improving system reliability, troubleshooting speed, and operational efficiency.

21CTO
21CTO
21CTO
Mastering Log Management: 16 Rules to Boost System Reliability

Preface

Logging records user actions, system status, and other events; it is a crucial component of any system but is often neglected until a problem occurs, revealing many issues.

Good logging speeds up problem location and can reveal risks before incidents.

We analyzed and optimized logs during the development and operation of NOS (Netease Object Storage) and summarize the experience.

Collected Experience

1. About Log Levels

Common log libraries (e.g., log4j) define levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL. In practice, deciding which events belong to which level requires discussion.

Project log level definitions should be clear and followed by all developers.

Even TRACE/DEBUG logs need a standard format so that developers, testers, and operators can locate issues.

Guidelines for each level:

FATAL – unrecoverable system errors that abort the application.

ERROR – errors that affect users and require immediate handling.

WARN – potentially harmful situations that may become errors if not addressed.

INFO – normal operation messages; should not exceed 10% of TRACE volume.

DEBUG or TRACE – detailed tracing for precise diagnosis.

Rule 1: The whole team (including operations) must have clear definitions for log levels and handling procedures.

2. Keep Log Content Updated

DEBUG/TRACE logs are crucial for troubleshooting; they must be complete, non‑redundant, and uniformly formatted. Good practices include:

Define a team‑wide standard for DEBUG/TRACE logs.

Regularly review logged content across development, operations, and testing.

Developers should perform operations while debugging to improve logging.

Operations and testing must promptly report observed issues to developers.

Rule 2: Periodically optimize log content to enable fast and accurate problem location.

3. Log Classification

Logs can be categorized by purpose: diagnostic, statistical, audit.

Diagnostic logs include request entry/exit, external service calls, resource usage, fault‑tolerance actions, exceptions, background tasks, startup/shutdown, etc.

Statistical logs include user access statistics and billing information.

Audit logs record administrative actions.

Rule 3: Clearly define the purpose of each log type and classify accordingly.

4. Avoid Useless Logs

In many applications, Fuse requests generate many NoSuchKey errors, flooding logs with meaningless entries.

Rule 4: Never print useless logs that drown out important information.

Solution: Add a User‑Agent header for Fuse requests and suppress NoSuchKey logs when detected.

5. Log Information Must Be Complete

Example: multipart upload workflow (InitMultiUpload, UploadPart, CompleteMultiUpload). Missing UploadID in logs made troubleshooting hard.

Recommended log items include system initialization parameters, all errors, all warnings, before/after values for persisted data, inter‑module request/response, important state changes, and long‑running task progress.

Do not log function entry points unless they represent significant events, large message bodies, or benign errors.

Rule 5: Log information should be accurate and comprehensive enough to locate problems solely from logs.

6. Test Logs

Test logs should contain environment, initial state, detailed steps, interaction info, expected and actual results.

Rule 6: Apply the same strict standards to test program logs.

7. Improve Logs from Issues

When an incident takes long to locate, refine logging; use logs to anticipate future problems.

Rule 7: Log optimization is a continuous effort that learns from errors.

8. RequestID

RequestID is generated from request time, server IP, and a random number, enabling identification of the handling machine.

./decode.sh 4b2c009a0a7800000142789f42b8ca96 Thu Nov 21 11:06:12 CST 2013 10.120.202.150 4b2c009a

Rule 8: Encode as much information as possible into RequestID.

9. Associate Entire Request Flow with RequestID

Missing RequestID in error stacks made correlation difficult; downstream services (video, image) also need the same RequestID.

Rule 9: Link the whole processing flow of a request with a unique RequestID.

10. Log Level on Production Machines

DEBUG logs are too verbose for all machines; keep INFO on most, enable DEBUG on one machine for troubleshooting.

Rule 10: Keep one machine with DEBUG level enabled.

11. Post‑deployment Log Observation

After a new feature rollout, observe logs (e.g., bucket cache operations) on the DEBUG‑enabled machine before scaling.

Rule 11: Observe logs after new servers go live to verify feature correctness.

12. Slow‑Operation Logs

Record request start and end times; promote to WARN if duration exceeds a threshold, also for external dependencies.

Rule 12: Use log level escalation to discover potential problems.

13. Log Alerts

Error count monitoring triggers alerts; keyword alerts (e.g., “Quota Warning”) inform users of capacity issues.

Rule 13: Monitor logs with alerts to detect issues before customers notice.

Rule 14: Use keywords in logs to determine system health.

14. Log Format

Inconsistent formats hinder automation; use a common logging function to standardize.

Rule 15: Keep log format uniform and standardized.

15. Separate Error Log Files

During high concurrency, isolate error logs into a dedicated file for easier analysis.

Rule 16: Output error logs to a separate file for analysis.

16. Log File Size Management

Split logs by day or hour based on volume; regularly delete old logs (e.g., older than 60 days) and collect them centrally.

Rule 17: Establish policies for log size, rotation, and deletion.

Summary of Experience

The whole team must have clear log‑level definitions and handling procedures.

Periodically optimize log content for fast, accurate issue location.

Define log purposes and classify accordingly.

Avoid useless logs that drown out important information.

Ensure logs are accurate and complete enough to locate problems.

Apply strict standards to test program logs.

Log optimization is an ongoing effort learning from errors.

Encode as much information as possible into RequestID.

Associate the entire request flow with a unique RequestID.

Enable DEBUG on one machine.

Observe logs after new servers are deployed to verify functionality.

Use log‑level escalation to discover potential problems.

Monitor logs with alerts to detect issues before customers notice.

Use keywords in logs to assess system status.

Maintain a uniform log format.

Separate error logs for analysis.

Establish policies for log size, rotation, and deletion.

References

[1] “Optimal Logging” Anthony Vallone, Google

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DebuggingmonitoringOperationssystem reliabilitybest practicesloggingLog Management
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.