Operations 19 min read

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Deep Dive into Logging Operations and Observability in Distributed Systems

This article provides a comprehensive exploration of logging systems in software development, covering their importance, implementation, and operational challenges in distributed architectures.

What is Logging: Logging is a time-ordered record of system events that captures what happened and when, enabling precise error定位 and root cause analysis. Logs are categorized into five severity levels: FATAL, WARNING, NOTICE, DEBUG, and TRACE.

Value of Logging: In large-scale web systems, logs record all system behaviors and are essential for troubleshooting, performance optimization, business decision-making through user behavior analysis, security monitoring (detecting login errors and abnormal access), and audit tracking.

Distributed Architecture Challenges: With microservices proliferation, challenges arise including inconsistent log formats across teams using different programming languages, rapid iteration leading to missing logs and incorrect severity levels, and container instances distributed across thousands of servers making request chain tracing difficult.

APM and Observability: APM (Application Performance Management) is an approach for observing and analyzing distributed architectures. The three pillars of observability are:

Logging: Records discrete events during application execution, providing detailed system state information but requiring significant storage and query resources.

Metrics: Aggregatable numerical data (CPU usage, memory, response times, QPS, GC counts) stored in time-series databases like Prometheus.

Tracing: Request-scoped information that traces the call sequence across services, helping identify abnormal points and performance bottlenecks.

Logging Tools - ELK Stack: Elasticsearch (distributed search engine), Logstash (data collection and transformation), and Kibana (visualization platform) form the standard solution for log management, enabling distributed search and multi-dimensional analysis.

Tracing Tools: OpenTracing provides vendor-neutral distributed tracing APIs. Apache SkyWalking is a Chinese open-source APM tool supporting multiple languages and storage backends.

Practical Implementation: The article shares Baidu Wenku's logging practices including aggregated monitoring via Argus and TSDB, batch log query tools using SSH for parallel execution, and full链路 tracing with trace ID propagation across nginx, nodejs, php, and go services.

Common Logging Issues: Unclear information, non-standard formats, insufficient logs, redundant logs, incorrect severity levels, string concatenation instead of placeholders, logging in loops, unsanitized sensitive data, missing log rotation, and lack of global trace context propagation.

Distributed SystemsmonitoringAPMobservabilitydevopsLoggingPrometheustracingELKSkyWalking
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.