Cloud Native 23 min read

vivo Distributed Tracing System Agent Technology Principles and Practical Experience

The 2017‑initiated vivo distributed tracing system leverages a JavaAgent‑based micro‑kernel architecture, using ByteBuddy for non‑intrusive bytecode instrumentation, a Disruptor lock‑free queue, and Kafka to capture Trace/Span data—including cross‑thread propagation—while employing sampling, degradation, and JVM metrics to ensure 94% adoption stability.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
vivo Distributed Tracing System Agent Technology Principles and Practical Experience

In 2017, the vivo Internet R&D team initiated the development of a distributed tracing system based on Google's Dapper paper, drawing inspiration from systems like SkyWalking, Zipkin, and PinPoint. This article provides an in-depth technical analysis of the Agent technology principles and practical experience.

Core Concepts: The system is built around two fundamental concepts - Trace represents the complete distributed system call chain for a single business logic request, while Span represents a single local call. TraceId (30-character string) serves as the key identifier, containing Linux PID (4 chars), IPv4 in hex (8 chars), environment identifier (1 char), millisecond timestamp (12 chars), and atomic auto-increment ID (4 chars).

Technical Implementation: The Agent uses JavaAgent technology for non-intrusive bytecode instrumentation. Key technical components include:

JavaAgent configuration with premain method and Instrumentation mechanism

Bytecode manipulation using ByteBuddy for method logic modification, field addition, and interface implementation

Cross-thread data transmission through custom ThreadPoolExecutor bytecode modification (addressing InheritableThreadLocal limitations in thread pool scenarios)

Disruptor-based lock-free queue for high-performance data buffering

Data Collection: The Agent implements AOP-style instrumentation for RPC calls and slow SQL monitoring. Span data flows through ThreadLocal缓存 → Disruptor → Kafka, with comprehensive governance strategies including sampling control, degradation, exception flow control, and JVM metrics collection.

Stability Assurance: The system achieves 94% adoption rate through non-blocking design principles - using Disruptor for log buffering, optimizing reflection usage, implementing try-catch throughout, and providing graceful degradation mechanisms.

Architecture: The Agent follows a micro-kernel architecture with isolated modules for logging, monitoring, strategy control, bytecode transformation, and class loading isolation. Class loading control is critical to avoid ClassNotFoundException issues in different runtime environments like Tomcat.

performance optimizationObservabilityDistributed TracingDisruptorJavaAgentbytecode-instrumentationSpanTraceId
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.