Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing
Youzan’s full‑link tracing system combines a multi‑language SDK, Java Agent dynamic attachment, transparent upgrades, asynchronous context propagation, and a Spark‑based data pipeline that indexes traces in Elasticsearch and stores them in HBase, enabling real‑time diagnostics, log correlation, and future container‑level tracing expansion.
In the context of increasingly complex enterprise‑level business systems, microservice architecture has become a standard for many medium and large enterprises. It splits a monolithic application into multiple subsystems and shared components, bringing benefits such as simplified isolation, reusable modules, faster iteration, flexible scalability, and suitability for cloud environments.
However, microservice architectures also introduce new challenges: a single user request may need to invoke dozens of subsystems, making error diagnosis and latency analysis difficult. A full‑link tracing system is designed to address these problems. The system typically consists of four major parts:
Client SDK for data collection and reporting.
Real‑time data processing system for indexing and storage.
User interaction system that provides UI for developers, testers, and operators.
Offline analysis system for statistical analysis and problem discovery.
Multiple programming languages are supported. Youzan currently tracks Java, Node.js, and PHP, using the Cat protocol, which aligns with the industry‑standard OpenTracing protocol. Both protocols share the concepts of Trace (identifies a request) and Span (identifies a node within the trace). The trace ID is generated at the entry point and propagated unchanged across all downstream calls, while each node creates its own span ID.
Java Agent and Attach API
The Java Agent is typically added to the JVM startup parameters with -javaagent . It registers a ClassFileTransformer that intercepts the JVMTI_EVENT_CLASS_FILE_LOAD_HOOK event, allowing bytecode transformation at class load time.
Because adding -javaagent to hundreds of startup scripts is labor‑intensive, Java 6 introduced the JVMAttachAPI (also referred to as AttachAPI ) which enables dynamic attachment to a running JVM process. The process ID can be obtained via JMX or java.lang.ProcessHandle , and the agent can be installed at runtime with ByteBuddyAgent.install() .
Byte‑Buddy is a powerful bytecode enhancement framework built on ASM, offering high‑level APIs such as subClass() , redefine() , and rebase() .
Transparent Upgrade
Youzan’s framework and middleware components are managed by a dedicated JAR container that starts before the application class loader. The tracing SDK is loaded by this container, allowing bytecode transformers to be installed before the application starts. This enables zero‑impact tracing and seamless, transparent upgrades of the SDK.
The following table compares the non‑transparent approach with the Java Agent approach:
Aspect
Non‑transparent
Java Agent
Instrumentation method
Dubbo Filter + Spring Interceptor + AOP + Javassist…
Unified interface and enhancement model
Integration cost
Strong dependency on Maven/Gradle and manual configuration
Transparent integration
Upgrade method
Upgrade each application individually
Transparent upgrade
Asynchronous Call Tracing
Within a single process, trace information is usually stored in a ThreadLocal . For asynchronous calls, this information can be lost. While InheritableThreadLocal propagates context to child threads, it does not work well with thread‑pool reuse. A common solution is the Capture/Replay model: capture the current context when creating an async thread, transfer the snapshot to the child thread, and replay it there.
Youzan provides utility classes such as FrameworkRunnable and FrameworkCallable that are enhanced to support Capture/Replay automatically. For custom threads, wrapping them with AsyncUtil.wrap() enables the same behavior.
Encountered Issues
Package conflicts : The SDK’s dependencies (e.g., Byte‑Buddy) may clash with application dependencies. Using Maven’s shade plugin, which rewrites package names based on ASM scanning, resolves this.
API coupling : Some scenarios require an explicit API for the business system. Providing an empty implementation and enhancing it at runtime decouples the API from the SDK.
Child‑first class loading deadlock : If the JAR container does not follow the parent‑first delegation model, class loading locks can cause deadlocks. Filtering in ClassFileTransformer mitigates this.
System Integration
The tracing system integrates with Youzan’s unified access system, allowing 100 % sampling for testing by setting a custom HTTP header (e.g., -XXDebug ) that forces the generation of a trace ID matching the sampling rule.
It also integrates with the Tianwang logging system. After tracing is enabled, the SDK puts the trace ID into MDC ( put ), and the logging SDK retrieves it to include in log entries, enabling trace‑based log queries and hyperlinking logs to trace details.
Data Processing Architecture
Trace data is processed by a Spark Streaming job in near‑real‑time, indexed into Elasticsearch, and stored in HBase. Data is reported via a local agent that forwards to a remote collector, which then pushes to a Kafka queue for the streaming job. This two‑stage reporting reduces network congestion and avoids overwhelming Kafka with a large number of direct connections.
Optimizations include:
Using a local agent to buffer and asynchronously send data, preventing SDK queue overflow.
Iterative reduction of data‑transfer hops, saving resources at scale.
Replacing a Java‑based processing job with Spark Streaming, halving CPU and memory consumption.
Conclusion and Outlook
The full‑link tracing system comprises SDK collection, data processing services, and user‑facing products. Future evolution will focus on:
Empowering Youzan Cloud developers with container‑level tracing.
Migrating data models and APIs to the OpenTracing standard for broader language support.
Continuous product iteration to improve user experience.
Supporting more components and middleware.
Enhancing offline analysis capabilities based on trace data.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.