Cloud Native 17 min read

How Uber Built Jaeger: From In‑House Tracing to a Cloud‑Native Open‑Source Platform

Uber’s engineering team chronicles the evolution of its distributed tracing system—from the early Merckx pull‑based solution and TChannel integration to the open‑source Jaeger platform—detailing architectural shifts, sampling strategies, multi‑language client libraries, and the move toward a fully cloud‑native, end‑to‑end observability stack.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
How Uber Built Jaeger: From In‑House Tracing to a Cloud‑Native Open‑Source Platform

From Monolith to Microservice Architecture

As Uber’s business grew rapidly, the number of microservices expanded from about 500 in late 2015 to over 2,000 by early 2017, increasing system complexity and reducing visibility across services. Traditional monitoring tools (metrics, logs) could not provide cross‑service insight, prompting the adoption of distributed tracing.

Uber’s First Tracing System: Merckx

Merckx, named after the world‑record cyclist, was Uber’s initial pull‑based tracing system for its Python monolith. It stored trace data in a tree‑like structure, exposed a Kafka‑backed command‑line query interface, and offered a web UI for predefined summaries. However, Merckx lacked distributed context propagation, relied on global thread‑local storage, and could not support asynchronous frameworks like Tornado.

Switching to TChannel for Tracing

In early 2015 Uber developed TChannel, an RPC multiplexing protocol that embedded tracing fields directly in its binary format (e.g., spanid:8 parentid:8 traceid:8 traceflags:1). Open‑source client libraries were released for multiple languages, allowing request context to flow from server to downstream calls. The libraries generated spans compatible with Zipkin’s annotations (cs, cr) and used a Reporter interface to push spans to the backend via Thrift.

Building Jaeger in New York

The New York observability team, formed in 2015, created a dedicated distributed‑tracing group and named the project “Jaeger” (German for hunter). Leveraging existing Cassandra expertise, they replaced the Riak/Solr prototype with a Go‑based collector that stored spans in Cassandra using Zipkin’s Thrift format. They also introduced a multiplication factor to amplify inbound traffic for stress testing.

To support services not using TChannel, Uber built client libraries for Go, Java, Python, and Node.js that implemented the OpenTracing API. These libraries added a novel feature: polling the backend for dynamic sampling strategies (e.g., always‑sample, probabilistic, rate‑limited) to adapt to traffic spikes without overwhelming the tracing system.

Jaeger Sidecar Agent and Dynamic Sampling

To eliminate dependencies on routing/discovery services, Uber introduced a Jaeger‑agent sidecar deployed alongside metric agents on every host. The agent receives spans over UDP, buffers them in memory, and polls the local proxy for sampling policies, enabling a “trace‑then‑sample” workflow.

Unified Distributed Tracing Architecture

While Jaeger initially relied on Zipkin UI and storage, Uber redesigned the data model to natively support key‑value logs, span references, and reduced per‑span tag duplication. A new Go‑based query service and a React front‑end replaced Zipkin UI, offering multiple visualizations (histograms, DAGs, critical‑path graphs) and embeddable components.

The updated architecture comprises Go backend components, language‑specific OpenTracing client libraries, a React UI, and an Apache Spark pipeline for post‑processing and aggregation.

Open‑Source Release and Future Work

All Jaeger client libraries (Go, Java, Node.js, Python) are open source, and the backend and UI code are being migrated to GitHub for full public release. Uber continues to evolve its tracing platform, aiming for broader language support, improved scalability, and tighter integration with its microservice ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeMicroservicesObservabilityDistributed Tracingopen sourcejaeger
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.