Operations 31 min read

How to Build a Low‑Cost Distributed Tracing System for Microservices

This article explains the evolution from a monolithic architecture to microservices, outlines the new pain points such as fault isolation, performance bottlenecks and scaling inefficiencies, and presents a practical, low‑cost distributed tracing solution with unified frameworks, components, configuration management, data collection, and visualization.

dbaplus Community

Feb 1, 2021

How to Build a Low‑Cost Distributed Tracing System for Microservices

1. Architecture before microservices

In a monolithic deployment a single site application directly accesses caches and databases, often clustered for high availability. Debugging relied on adding log statements in the application layer and measuring execution time of a few steps.

2. Pain points after adopting microservices

Fault isolation : Multiple services, clusters and network layers require SSHing into many nodes, checking logs and coordinating across teams.

Performance bottleneck identification : An HTTP request traverses many services, databases and caches, making it hard to pinpoint the slowest component.

Inefficient call patterns : Remote calls placed inside loops cause massive latency and complicate capacity planning.

3. Desired characteristics of a distributed tracing system

Full‑link visibility : Show the complete call chain from the entry HTTP request through every service, database and cache.

Cross‑process tracking : Propagate identifiers across machines and processes.

Full‑traffic collection : Capture every request, not just a sampled subset.

Additional metadata such as request IDs, timestamps, call depth, SQL statements and cache keys.

4. Core tracing challenges

Cross‑process tracing requires three custom fields in the RPC protocol:

Request ID – a globally unique identifier for the whole trace.

Sequence ID – a logical ordering number that does not depend on synchronized clocks.

Depth ID – indicates the call depth to differentiate parallel branches.

Because the RPC framework is self‑developed, these fields can be added directly to the protocol header.

5. Practical implementation

Unified framework : Instrument the entry and exit points of both the site framework and the service (RPC) framework to record timestamps and parameters.

Unified component wrappers : Wrap database and cache clients (e.g., Redis, Memcached) so that a single modification can emit execution time, SQL statements and cache keys.

Unified configuration management :

Stage 1 – Centralized configuration files (e.g., global.conf) to avoid per‑service duplication.

Stage 2 – A shared configuration market that reduces redundancy.

Stage 3 – A full configuration centre that registers services, notifies dependents and drives dynamic connection management.

Data collection :

UDP SDK – a low‑latency fire‑and‑forget reporter that sends trace data to a UDP collector, later persisted to Elasticsearch.

Asynchronous file logging – write locally first, then batch‑push to the collector, minimizing impact on request latency.

Only about ten instrumentation points are needed: request entry/exit in the site and RPC frameworks, send/receive in cache and DB clients, and RPC client/server boundaries.

6. Visualization

The backend renders a timeline view that shows total request time, per‑service breakdown, parameters, SQL statements and cache keys. Heat‑maps highlight the longest‑running nodes, enabling rapid diagnosis of failures, performance hot‑spots and unreasonable call patterns.

7. Benefits

Fast discovery of online issues.

Quick pinpointing of performance bottlenecks.

Immediate identification of unreasonable service calls (e.g., calls hidden inside loops).

Low‑cost implementation using a unified framework, component wrappers and lightweight data collection, suitable for small teams or startups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability configuration management Distributed Tracing Performance debugging

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.