Cloud Native 14 min read

How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges

This article explains why distributed tracing is essential for modern cloud‑native systems, outlines the three toughest problems—instrumentation, data collection, and context propagation—and shows how Alibaba Cloud ARMS and OpenTelemetry provide a comprehensive, multi‑language solution for end‑to‑end traceability.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges

Why Distributed Tracing Matters

When a user experiences a failed order on a delivery app, a slow navigation response during a road trip, or an unresponsive AI assistant, the underlying cause often lies in hidden performance bottlenecks. SREs, AppOps, and business operators all need a way to see the exact path of each request across user devices, gateways, backend services, and dependent components.

The Three Major Challenges of End‑to‑End Tracing

Instrumentation (Link Insertion) : Adding tracing code before and after critical methods to record method name, latency, and status. The difficulty is deciding which methods to instrument, managing the instrumentation at low cost, and ensuring accuracy, performance, and stability.

Data Collection and Processing : Gathering generated trace data in a backend for analysis. Challenges include capturing complete data from cloud services (e.g., gateways) and normalizing heterogeneous trace models.

Context Propagation : Transmitting trace context across services. Multiple protocols (w3c, b3, jaeger, skywalking) exist, and mismatched protocols cause broken traces, especially during migrations (e.g., Skywalking → OpenTelemetry).

Alibaba Cloud ARMS End‑to‑End Tracing Solution

ARMS (including the OpenTelemetry‑compatible observability version) supports full‑stack tracing from user terminals (Web/H5/mini‑program, Android, iOS) through cloud gateways (ALB, MSE, Ingress, ASM, API Gateway) to backend applications (Java, Go, Python, etc.) and cloud components (databases, messaging, large models). The solution offers:

Automatic instrumentation for Java and Go (Java already GA, Go releasing July).

OpenTelemetry compatibility with four major tracing frameworks (OpenTelemetry, SkyWalking, Zipkin, Jaeger) covering 10+ languages.

One‑click enablement for many Alibaba Cloud products, turning logs into trace data when needed.

Supported languages and recommended integration methods:

Java – automatic instrumentation via ARMS agent.

Go – automatic instrumentation (July release) or SkyWalking → ARMS.

Python – automatic instrumentation (July release) or OpenTelemetry → ARMS.

Node.js, .NET, PHP, Erlang, C++, Swift, Ruby – manual or OpenTelemetry‑based instrumentation.

Key Capabilities

ARMS provides sampling strategies (fixed‑ratio, adaptive, error‑slow, custom), span compression, log‑trace correlation, resource‑level metrics (RED, JVM, thread‑pool, host), dimension drill‑down, continuous profiling, memory diagnostics, and online debugging (Arthas). The OpenTelemetry version offers the same data with self‑managed agents.

Integration Scenarios

Different entry points (user terminals, gateways, backend services, dependent components) have specific guides. For example, Web/H5 uses user‑experience monitoring linked to traces; Android and iOS report via OpenTelemetry; ALB, MSE, API Gateway, ASM, and ACK Ingress each have a dedicated trace‑enable switch. Backend services can use ARMS native probes or OpenTelemetry agents, and over 100 plugins cover RPC, messaging, databases, and job scheduling.

Context Propagation and Compatibility

ARMS agents can act as protocol translators (e.g., Jaeger → Zipkin B3) to bridge mismatched stacks, ensuring trace IDs flow end‑to‑end. Dual‑probe coexistence allows gradual migration from legacy tracing frameworks to OpenTelemetry.

Behavior Guidelines and Best Practices

Beyond basic instrumentation, organizations should define sampling policies, tag traffic, and correlate traces with metrics, logs, and events. OpenTelemetry best‑practice guides (code, docs, videos) are provided on GitHub to help developers handle async context propagation, span filtering, header injection, and more.

Future Outlook

Alibaba Cloud plans to expand protocol support, enrich the tracing ecosystem (full‑link load testing, gray releases, architecture awareness, root‑cause analysis), and apply tracing to large‑model (LLM) workloads. The upcoming LLM Trace service (May 2024) will visualize model training and inference pipelines to diagnose hallucinations and performance issues.

observabilityOpenTelemetryDistributed TracingAlibaba CloudARMS
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.