Why Most Logging and Metrics Strategies Fail – and How to Fix Them
The author reflects on the shortcomings of current logging, metrics, and tracing practices, explains why they become costly and unscalable, and offers concrete recommendations—including log level discipline, structured logging, metric aggregation, and the use of tools like Prometheus, Cortex, and Thanos—to build a more efficient observability stack.
Logs
Uncontrolled log volume quickly becomes unscalable and expensive. Log‑level monitoring and the proliferation of formats (JSON, Windows Event Log, GELF, Nginx, etc.) create fragile pipelines and high storage costs. Key recommendations:
Question the necessity of storing every log line. If logs are not required for compliance, consider dropping them entirely.
Separate audit logs from operational logs. A common pattern is to write audit events to DynamoDB via an SQS → Lambda pipeline, using TTL for automatic expiration.
Enforce a low SLA for non‑critical logs (e.g., 99% SLA reduces expected downtime to ~7 h 14 min per month).
Apply sampling. OpenTelemetry provides an alpha priority‑based log sampling feature that lets you lower the sampling rate as services mature.
Standardize log structure (e.g., JSON) and validate it at build time to avoid format drift.
Typical log formats discussed:
JSON – easy to parse but nested structures can break parsers.
Windows Event Log – high volume, limited standardization.
GELF – UDP‑friendly, used by large companies.
Common Log Format / Nginx combined format – classic Apache‑style lines.
Metrics
Prometheus‑compatible metrics are easy to add, but scaling Prometheus, handling high cardinality, and ensuring high availability require careful design. Three common scaling patterns are:
Hierarchical federation : a top‑level Prometheus scrapes lower‑level instances.
Cross‑service federation : multiple Prometheus servers write to a shared remote storage and a central query layer reads from them.
Long‑term storage : offload historic data to systems such as Cortex or Thanos.
Retention should be bounded (e.g., 30‑day retention on a single Prometheus instance). For longer retention, consider:
Cortex
Cortex provides a push‑gateway architecture with the following components:
Distributor
Ingester
Querier
Compactor
Store gateway
Alertmanager (optional)
Configs API (optional)
Overrides exporter (optional)
Query frontend (optional)
Query scheduler (optional)
Ruler (optional)
Running Cortex requires a key‑value store for metadata and careful configuration of each service.
Thanos
Thanos operates as a sidecar to Prometheus and consists of:
Sidecar – attaches to Prometheus, uploads blocks to object storage.
Store Gateway – serves blocks from object storage.
Compactor – deduplicates and down‑samples data.
Receiver – accepts remote‑write traffic.
Ruler – evaluates alerting rules.
Querier – provides Prometheus‑compatible query API.
Query Frontend – caches queries and splits large requests.
Both Cortex and Thanos add operational overhead; choose based on required scale, query latency, and cost.
Tracing
Distributed tracing provides high‑resolution request flow visibility with built‑in sampling, making it cheaper than exhaustive logging. SaaS solutions (e.g., Google Cloud Trace) are easy to adopt and cost‑effective, but adoption is often low. Key points:
Tracing is primarily a debugging tool, not a compliance or business‑intelligence source.
Sampling reduces data volume; you can configure the sampling rate from 0 % to 100 %.
When using a SaaS provider, monitor usage to avoid unexpected charges.
Conclusion
Effective observability requires disciplined handling of each signal:
Logs : store only what is needed, use separate audit pipelines, and apply sampling.
Metrics : keep retention short, offload historic data to Cortex or Thanos, and monitor cardinality.
Tracing : leverage built‑in sampling and consider SaaS providers for simplicity.
Balancing cost, scalability, and operational overhead is essential to avoid over‑engineered, expensive monitoring stacks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
