Operations 18 min read

Why Most Logging and Metrics Strategies Fail – and How to Fix Them

The author reflects on the shortcomings of current logging, metrics, and tracing practices, explains why they become costly and unscalable, and offers concrete recommendations—including log level discipline, structured logging, metric aggregation, and the use of tools like Prometheus, Cortex, and Thanos—to build a more efficient observability stack.

dbaplus Community

Jul 10, 2023

Why Most Logging and Metrics Strategies Fail – and How to Fix Them

Logs

Uncontrolled log volume quickly becomes unscalable and expensive. Log‑level monitoring and the proliferation of formats (JSON, Windows Event Log, GELF, Nginx, etc.) create fragile pipelines and high storage costs. Key recommendations:

Question the necessity of storing every log line. If logs are not required for compliance, consider dropping them entirely.

Separate audit logs from operational logs. A common pattern is to write audit events to DynamoDB via an SQS → Lambda pipeline, using TTL for automatic expiration.

Enforce a low SLA for non‑critical logs (e.g., 99% SLA reduces expected downtime to ~7 h 14 min per month).

Apply sampling. OpenTelemetry provides an alpha priority‑based log sampling feature that lets you lower the sampling rate as services mature.

Standardize log structure (e.g., JSON) and validate it at build time to avoid format drift.

Typical log formats discussed:

JSON – easy to parse but nested structures can break parsers.

Windows Event Log – high volume, limited standardization.

GELF – UDP‑friendly, used by large companies.

Common Log Format / Nginx combined format – classic Apache‑style lines.

Metrics

Prometheus‑compatible metrics are easy to add, but scaling Prometheus, handling high cardinality, and ensuring high availability require careful design. Three common scaling patterns are:

Hierarchical federation : a top‑level Prometheus scrapes lower‑level instances.

Cross‑service federation : multiple Prometheus servers write to a shared remote storage and a central query layer reads from them.

Long‑term storage : offload historic data to systems such as Cortex or Thanos.

Retention should be bounded (e.g., 30‑day retention on a single Prometheus instance). For longer retention, consider:

Cortex

Cortex provides a push‑gateway architecture with the following components:

Distributor

Ingester

Querier

Compactor

Store gateway

Alertmanager (optional)

Configs API (optional)

Overrides exporter (optional)

Query frontend (optional)

Query scheduler (optional)

Ruler (optional)

Running Cortex requires a key‑value store for metadata and careful configuration of each service.

Thanos

Thanos operates as a sidecar to Prometheus and consists of:

Sidecar – attaches to Prometheus, uploads blocks to object storage.

Store Gateway – serves blocks from object storage.

Compactor – deduplicates and down‑samples data.

Receiver – accepts remote‑write traffic.

Ruler – evaluates alerting rules.

Querier – provides Prometheus‑compatible query API.

Query Frontend – caches queries and splits large requests.

Both Cortex and Thanos add operational overhead; choose based on required scale, query latency, and cost.

Tracing

Distributed tracing provides high‑resolution request flow visibility with built‑in sampling, making it cheaper than exhaustive logging. SaaS solutions (e.g., Google Cloud Trace) are easy to adopt and cost‑effective, but adoption is often low. Key points:

Tracing is primarily a debugging tool, not a compliance or business‑intelligence source.

Sampling reduces data volume; you can configure the sampling rate from 0 % to 100 %.

When using a SaaS provider, monitor usage to avoid unexpected charges.

Conclusion

Effective observability requires disciplined handling of each signal:

Logs : store only what is needed, use separate audit pipelines, and apply sampling.

Metrics : keep retention short, offload historic data to Cortex or Thanos, and monitor cardinality.

Tracing : leverage built‑in sampling and consider SaaS providers for simplicity.

Balancing cost, scalability, and operational overhead is essential to avoid over‑engineered, expensive monitoring stacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Metrics Logging prometheus Tracing Thanos cortex

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.