How Alibaba Cloud Function Compute Uses OpenTelemetry for Full‑Stack Tracing
The article explains how Alibaba Cloud Function Compute upgraded its tracing capabilities from Jeager 2.0 to the OpenTelemetry W3C standard, delivering end‑to‑end observability, transparent cold‑start analysis, cross‑environment context propagation, dynamic sampling, and AI‑assisted debugging for serverless workloads.
Background
Request‑level tracing is a core capability for diagnosing performance bottlenecks in distributed systems. Alibaba Cloud Function Compute (FC) has replaced its previous Jaeger 2.0 implementation with the OpenTelemetry W3C standard. This upgrade provides full‑path observability of function execution, turning the traditional serverless “black‑box” into a transparent, traceable component.
Key Features
FC system‑level span propagation : Critical lifecycle events of internal components (scheduler, cold‑start module, etc.) are emitted as OpenTelemetry spans, covering the entire function lifecycle from dispatch, initialization, execution to release.
Automatic stitching of business spans : User‑defined spans created inside function code are automatically linked to the FC system spans, producing an end‑to‑end trace that highlights hotspots such as cold‑start latency or resource contention.
Standard W3C header support : The implementation respects traceparent, tracestate and baggage headers, ensuring lossless context propagation to downstream services (databases, message queues, etc.).
Cross‑environment interoperability : Trace context can be passed across functions, services and even cloud providers, allowing seamless integration with existing OpenTelemetry tooling.
Dynamic cost control via sampling : Sampling rates are configurable per function (e.g., 1 % for routine monitoring, 100 % for incident investigation) to balance data volume against resource overhead.
Usage Scenarios
After enabling tracing, developers can view function execution details in both the FC console and any OpenTelemetry‑compatible tracing UI. A typical example is a LangChain client invoking a Gaode weather service; the trace shows cold‑start time, SSE connections, message streams and agent calls.
Diagnostic steps :
Inspect the PrepareCode span to evaluate cold‑start duration; if it exceeds expectations, reduce the deployment package size.
When using custom runtimes or container images, monitor the RuntimeInitialization span; unusually long initialization indicates the need to optimise startup scripts.
AI‑Assisted Debugging
If an abnormal request is detected, the corresponding trace can be examined for error details. The AI Operations Assistant can further analyse the trace to pinpoint root causes, accelerating incident resolution.
LLM Application Monitoring
Installing the ARMS Python probe on Large Language Model (LLM) applications enables call‑chain analysis. The tracing UI displays spans for input processing, output generation, token consumption and other metrics, providing fine‑grained visibility into LLM workloads.
Effect Comparison
Before the upgrade , tracing relied on log aggregation, making it impossible to separate system latency from business latency and causing context loss across services.
After the upgrade , spans are visualised with clear segmentation, allowing precise bottleneck identification. W3C header propagation restores complete end‑to‑end trace continuity across heterogeneous environments.
Summary
The deep integration of Function Compute with OpenTelemetry delivers full‑stack trace transparency, covering both system‑level and business‑level spans. By adhering to the unified W3C protocol, data silos are eliminated and cross‑environment consistency is ensured. Dynamic sampling provides an economical observability solution, while AI‑assisted diagnostics and the ARMS Python probe extend troubleshooting capabilities for complex workloads such as LLM applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
