Operations 20 min read

Log Management, Observability, and APM Practices in Distributed Systems

This article explains what logs are, when to record them, their value in large‑scale architectures, and how to build effective logging, metrics, and tracing platforms using tools such as ELK, Prometheus, and SkyWalking, while also presenting good and bad logging practices and sample batch‑log retrieval code.

Architect
Architect
Architect
Log Management, Observability, and APM Practices in Distributed Systems

Logs are time‑ordered records that capture events, enabling precise system tracing, error diagnosis, and performance analysis. They are classified by severity (FATAL, WARNING, NOTICE, DEBUG, TRACE) and are essential for debugging, performance optimization, security monitoring, and audit trails in large‑scale distributed architectures.

Recording logs at appropriate moments—such as system start‑up, error occurrences, and critical business flows—provides the raw data needed for troubleshooting and operational insight.

The value of logs lies in their ability to capture every system behavior, support incident investigation, guide product decisions through user‑behavior analysis, and reveal security threats like login failures or abnormal accesses.

In distributed environments, manual log access via SSH and grep becomes inefficient; centralized log collection, processing, storage, and visualization platforms are required. These platforms enable fast multi‑dimensional queries, reduce storage costs, and support real‑time alerting.

Observability combines three pillars—Logging, Metrics, and Tracing. Metrics (e.g., CPU, memory, request latency) are aggregated numeric data stored in time‑series databases like Prometheus. Tracing records request‑level call chains, allowing pinpointing of latency bottlenecks. Together they provide a comprehensive view of system health.

Prometheus, an open‑source monitoring solution, scrapes metrics, stores them in a TSDB, and integrates with Grafana for visualization and Alertmanager for notifications.

ELK (Elasticsearch, Logstash, Kibana) offers a powerful log search and analysis stack. Elasticsearch provides distributed full‑text search, Logstash processes and enriches log streams, and Kibana visualizes the results.

OpenTracing defines a vendor‑neutral API for distributed tracing. Implementations such as Zipkin, Jaeger, and Apache SkyWalking (a CNCF top‑level project) enable end‑to‑end request tracing across heterogeneous microservices.

Combining metrics, logging, and tracing yields richer insights: aggregated event statistics, detailed request‑level logs, and call‑chain performance data, facilitating rapid fault isolation and system optimization.

Good logging practices include clear messages, standardized formats, appropriate severity levels, avoidance of redundant logs, masking sensitive data, and hourly log rotation. Bad practices involve ambiguous messages, inconsistent formats, excessive or insufficient logging, misuse of severity levels, and lack of trace propagation.

For batch log retrieval, a Go program can concurrently SSH into multiple instances and grep for specific log patterns. The code is shown below:

package main

import (
  "fmt"
  "log"
  "os/exec"
  "runtime"
  "sync"
)

var wg sync.WaitGroup

func main() {
  runtime.GOMAXPROCS(runtime.NumCPU())
  instancesHost := getInstances()
  wg.Add(len(instancesHost))
  for _, host := range instancesHost {
    go sshCmd(host)
  }
  wg.Wait()
  fmt.Println("over!")
}

func sshCmd(host string) {
  defer wg.Done()
  logPath := "/xx/xx/xx/"
  logShell := "grep 'FATAL' xx.log.20230207"
  cmd := exec.Command("ssh", "PasswordAuthentication=no", "ConnectTimeout=1", host, "-l", "root", "cd", logPath, "&&", logShell)
  out, err := cmd.CombinedOutput()
  fmt.Printf("exec: %s\n", cmd)
  if err != nil {
    fmt.Printf("combined out:\n%s\n", string(out))
    log.Fatalf("cmd.Run() failed with %s\n", err)
  }
  fmt.Printf("combined out:\n%s\n", string(out))
}

func getInstances() []string {
  return []string{
    "x.x.x.x",
    "x.x.x.x",
    "x.x.x.x",
  }
}

Deploying this tool on a control machine enables fast, concurrent log extraction without requiring a dedicated agent, simplifying operational workflows.

In conclusion, building a suitable log‑ops platform that integrates collection, analysis, storage, and alerting transforms passive log files into active observability assets, greatly aiding debugging, performance tuning, and system reliability.

distributed systemsAPMobservabilityMetricsloggingPrometheustracingELK
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.