Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices

This guide walks through the fundamentals of OpenTelemetry, covering component architecture, environment setup, SDK and Collector configuration for Java, Go, and Kubernetes, and dives into common pitfalls, performance tuning, security hardening, high‑availability deployment, and advanced tail‑based sampling strategies.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering OpenTelemetry: From Setup to Advanced Sampling and Production‑Ready Practices

Overview

OpenTelemetry (OTel) is the CNCF‑graduated standard for full‑stack tracing in distributed systems. Its flexibility introduces configuration complexity, especially around defaults that can cause performance surprises.

Technical components

SDK – instrumentation library that creates spans.

Collector – independent pipeline that receives, processes and exports telemetry.

Exporter – sends data to back‑ends such as Jaeger, Zipkin or Tempo.

Propagator – carries trace context across service boundaries.

Typical environment

OpenTelemetry SDK 1.32.0 (Java/Go/Python)

OpenTelemetry Collector 0.96.0 (contrib build)

Jaeger 1.54.0 with Elasticsearch 8.x

Kubernetes 1.28 (deployment platform)

Getting started and common pitfalls

Sampling strategy

Development / testing – 100 %.

Staging – 10‑50 %.

Production – 0.1‑1 % with dynamic or tail‑based sampling.

Collector deployment modes

# Mode 1: Sidecar (one Collector per Pod)
# Mode 2: DaemonSet (one Collector per Node)
# Mode 3: Deployment (stand‑alone Collector cluster)

Production often combines a DaemonSet for ingestion and a Deployment for aggregation and export.

SDK configuration (Java Spring Boot example)

# application.yml
otel:
  service:
    name: order-service
  traces:
    exporter: otlp
    otlp:
      endpoint: http://otel-collector:4317
      protocol: grpc
    resource:
      attributes:
        deployment.environment: production
        service.version: 1.2.3

Pitfall 1 : Using the HTTP port (4318) for a gRPC endpoint prevents data from being sent.

Pitfall 2 : Missing required resource attributes (service.name, service.version, k8s.pod.name) makes debugging difficult.

Collector configuration (YAML)

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 16384
exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: false
      cert_file: /etc/ssl/certs/collector.crt
      key_file: /etc/ssl/private/collector.key
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]

Pitfall 3 : Processor order matters – memory_limiter must be first.

Pitfall 4 : Improper send_batch_size leads to either excessive network calls or high memory usage.

Pitfall 5 : Under‑provisioned memory (default 512 MiB) causes OOM; estimate ~200 MiB per 1 000 spans/s.

Kubernetes DaemonSet deployment

# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: collector
        image: otel/opentelemetry-collector-contrib:0.96.0
        args: [--config=/conf/otel-collector-config.yaml]
        ports:
        - containerPort: 4317
          hostPort: 4317
          protocol: TCP
        - containerPort: 4318
          hostPort: 4318
          protocol: TCP
        - containerPort: 13133
          protocol: TCP
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: "500m"
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: "/"
            port: 13133
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: "/"
            port: 13133
          initialDelaySeconds: 5
          periodSeconds: 5
        volumeMounts:
        - name: config
          mountPath: /conf
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

Pitfall 6 : Tail‑based sampling buffers full traces in memory; with 100 000 traces at ~10 KB each, memory can reach 1 GB. Adjust num_traces and allocate sufficient memory.

Advanced collector features – Tail‑based sampling

# otel-collector-tail-sampling.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  groupbytrace:
    wait_duration: 10s
    num_traces: 100000
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
    - name: errors-policy
      type: status_code
      status_code:
        status_codes: [ERROR]
    - name: latency-policy
      type: latency
      latency:
        threshold_ms: 2000
    - name: debug-policy
      type: string_attribute
      string_attribute:
        key: debug
        values: ["true"]
    - name: probabilistic-policy
      type: probabilistic
      probabilistic:
        sampling_percentage: 1
    - name: composite-policy
      type: composite
      composite:
        max_total_spans_per_second: 10000
        policy_order: [errors-policy, latency-policy, debug-policy, probabilistic-policy]
        rate_allocation:
        - policy: errors-policy
          percent: 30
        - policy: latency-policy
          percent: 30
        - policy: debug-policy
          percent: 10
        - policy: probabilistic-policy
          percent: 30
exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [groupbytrace, tail_sampling]
      exporters: [otlp/jaeger]

Best practices

Performance optimisation

Reduce span count – aggregate loop iterations into a single span with attributes instead of creating a span per iteration.

Asynchronous export – use BatchSpanProcessor with tuned queue and batch sizes:

BatchSpanProcessor.builder(exporter)
    .setMaxQueueSize(10000)
    .setMaxExportBatchSize(512)
    .setScheduleDelay(5, TimeUnit.SECONDS)
    .build();

Context propagation overhead – send only essential headers (e.g., traceparent) and prefer binary propagation for gRPC.

Security hardening

Redact or hash sensitive attributes in the Collector before export:

processors:
  attributes:
    actions:
    - key: http.request.header.authorization
      action: delete
    - key: user.email
      action: hash
    - key: db.statement
      action: truncate
      truncate:
        max_length: 1000

Enable TLS between Collector and back‑ends, or use a service‑mesh (Istio/Linkerd) for mTLS inside Kubernetes.

High availability

Deploy the Collector as a Deployment with multiple replicas and an HPA for scaling.

Run Jaeger + Elasticsearch as a clustered service with replica settings.

Common errors

Trace broken – context not propagated; ensure HTTP client injects the propagator.

Span loss – Collector queue full; increase maxQueueSize or add replicas.

High latency – synchronous export; switch to BatchSpanProcessor.

OOM – insufficient memory; raise Collector memory or lower num_traces for tail sampling.

Inconsistent data – sampling decisions differ; use parent‑based sampling across services.

Troubleshooting and monitoring

Log inspection

Java SDK – set java.util.logging level to FINE.

Go SDK – set environment variable OTEL_LOG_LEVEL=debug.

Collector – enable debug logging in service.telemetry.logs.level.

Key Collector metrics

otelcol_processor_batch_batch_send_size

– batch size. otelcol_exporter_sent_spans – successful spans. otelcol_exporter_send_failed_spans – failed spans. otelcol_processor_dropped_spans – dropped spans.

Prometheus alerts (example)

# Alert when Collector memory >90%
- alert: OTelCollectorHighMemory
  expr: container_memory_usage_bytes{container="otel-collector"} / container_spec_memory_limit_bytes{container="otel-collector"} > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "OTel Collector memory usage above 90%"

# Alert when spans are dropped
- alert: OTelCollectorSpansDropped
  expr: rate(otelcol_processor_dropped_spans[5m]) > 100
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "OTel Collector dropping spans"

Backup & restore for Elasticsearch back‑ends

# Create snapshot repository
curl -X PUT "localhost:9200/_snapshot/jaeger_backup" -H 'Content-Type: application/json' -d '{"type": "fs", "settings": {"location": "/mnt/backups/jaeger"}}'
# Create snapshot
curl -X PUT "localhost:9200/_snapshot/jaeger_backup/snapshot_$(date +%Y%m%d)"

Conclusion

Start with a low production sampling rate and use tail‑based sampling for error or latency‑critical traces.

Allocate sufficient memory for the Collector; memory_limiter must be first in the processor chain.

Processor order is critical: memory_limiter → batch → other processors.

Unified context propagation prevents trace breaks.

Redact sensitive data at the Collector level.

Monitor OTel components themselves to avoid blind spots.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

observabilityKubernetesOpenTelemetryDistributed TracingSamplingcollectorjaeger
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.