How OpenTelemetry and Jaeger Power Cloud‑Native Tracing
This article explains cloud‑native observability, defines its three pillars—metrics, tracing, and logging—details the OpenTelemetry tracing data model and Span structure, reviews industry implementations such as Jaeger and Alibaba Eagle Eye, and shares practical challenges and solutions from real‑world production use.
Concept Introduction
Observability in cloud‑native systems is the ability to infer internal state from external outputs. The three foundational pillars are Metrics, Tracing, and Logging.
Metrics : Aggregatable atomic values such as counters or histograms (e.g., number of incoming HTTP requests).
Tracing : Captures request‑scoped data and metadata (e.g., the actual SQL query sent to a database).
Logging : Handles discrete events such as debug or error messages, typically written to split‑file streams for cluster‑wide processing.
Metrics consume the least resources because they are highly compressible; logging can dominate traffic volume, while tracing falls between the two in overhead.
Tracing Data Model (OpenTelemetry Example)
OpenTelemetry defines a trace as a directed acyclic graph (DAG) of Span objects. Each Span encapsulates the following state:
Name
Start and End Timestamps
Span Context
Two identifiers: Trace ID (identifies the overall trace) and Span ID (uniquely identifies the span within the trace).
Attributes
Key‑value metadata that annotates the span with additional information about the operation.
Span Events
Structured log‑like messages representing meaningful points in time within the span.
Span Links
Associations to one or more other spans, describing upstream/downstream relationships, useful for asynchronous workflows.
Span Status
Status code indicating the outcome of the operation.
Industry Tracing Implementations
Uber Jaeger
Jaeger is an open‑source cloud‑native tracing platform (CNCF graduated 2017) that fully supports the OpenTelemetry standard.
Jaeger’s architecture consists of the following components: jaeger-client: SDK that collects spans, supports dynamic traffic simulation, and is aware of storage pressure. jaeger-agent: Enforces sampling policies. jaeger-collector: Aggregates, processes, and stores tracing data. jaeger-query and jaeger-ui: Provide query capabilities and a user interface.
Jaeger integrates with middleware instrumentation, supports multiple protocols (e.g., HTTP), and can store data in Cassandra, Elasticsearch, or other open‑source back‑ends.
Official site: https://www.jaegertracing.io/
Alibaba Eagle Eye
Eagle Eye is Alibaba’s log‑based distributed tracing system built for high‑traffic events such as Double‑11. It addresses fault localization, capacity estimation, and resource waste by providing real‑time link analysis and visualized monitoring.
Key characteristics:
Lightweight architecture with real‑time streaming data presentation.
Visualized monitoring pipelines that lower integration cost for developers.
Selective sampling based on analysis scenarios to reduce data volume.
The platform supports HTTP/TCP protocols, middleware or bytecode‑enhanced instrumentation, and stores data in HDFS, HBase, HStore, or MPP databases.
Practical Challenges and Solutions (Baidu Experience)
Large‑scale tracing in production faces several difficulties:
High data volume : Requires high‑performance SDKs, efficient sampling strategies, optimized encoding/mapping algorithms, and tiered storage based on data type and usage.
Low integration cost : SDKs must be easy to adopt, with simple APIs and minimal developer effort; automatic instrumentation should cover most use‑cases, while custom hooks remain straightforward.
Stability requirements : Use local persistence as a buffer, combine tracing traffic with background tasks, and implement robust retry mechanisms.
Advanced feature demands : Include confidence analysis of metrics, real‑time multi‑window aggregation for short‑term and long‑term trends, and concise visualizations that convey maximum information with minimal indicators.
References
OpenTelemetry – Spans: https://opentelemetry.io/docs/concepts/signals/traces/#spans-in-opentelemetry
Benjamin H. Sigelman, Luiz André Barroso et al., “Dapper: A Large‑Scale Distributed Systems Tracing Infrastructure”, 2010.
Uber Jaeger engineering blog: https://eng.uber.com/distributed-tracing/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
