How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring
This guide explains how to achieve end‑to‑end observability for Dify low‑code LLM applications by combining Dify's built‑in monitoring, third‑party tracing services like Langfuse, and Alibaba Cloud's CloudMonitor with Python and Go probes, covering component‑level tracing, configuration steps, and trace linking for debugging and performance optimization.
Background and Challenges
Dify is a popular low‑code LLM application platform that integrates model support, prompt orchestration, RAG engines, workflow/agent frameworks, and plugin ecosystems. Production‑grade Agentic applications involve many dynamic elements—session history, memory handling, tool calls, knowledge‑base retrieval, model generation, script execution, and flow control—making their behavior highly unpredictable.
Observability is essential throughout the lifecycle of Agentic app development, debugging, operations, and iterative optimization. It must serve two perspectives: developers who need detailed workflow execution metrics and operators who monitor the Dify cluster, its components, and upstream/downstream dependencies.
Current Observability Gaps
Built‑in Application Monitoring : Collects execution details from Dify's engine and stores them in the Dify database. It offers tight integration but suffers from limited query capabilities (only session ID or user ID) and performance degradation at scale due to heavy DB writes.
Third‑Party Tracing Services (e.g., Langfuse, LangSmith): Integrated via Dify's custom OpsTrace mechanism. They provide workflow‑level trace data but lack full‑chain visibility, have coarse data granularity, and cannot be easily linked to Dify's internal traces.
Native OpenTelemetry (OTel) Support : Dify ships with OTel for Flask, HTTP, Redis, DB, and Celery, but it only instruments framework layers, omits internal workflow logic, and cannot be correlated with the third‑party traces.
Full‑Panorama Observability Solution
To bridge these gaps, a combined solution is proposed:
Deploy non‑intrusive Python and Go probes that automatically instrument Dify's execution engine, plugin daemon, sandbox, and worker processes.
Enable Dify's official cloud‑monitoring integration (Langfuse/CloudMonitor) to capture LLM‑level workflow traces.
Use Alibaba Cloud's Trace Link feature to associate workflow traces with infrastructure traces, achieving end‑to‑end traceability.
Key Components Monitored
API service (Flask/HTTP)
Plugin daemon (Go)
Sandbox (Python/Node.js)
Worker (Celery)
Nginx gateway
Step‑by‑Step Integration
1. Prepare Alibaba Cloud Credentials
Obtain a LicenseKey and Endpoint from the CloudMonitor 2.0 console (or ARMS for older versions). Record the values for later use.
2. Install and Configure the Python Probe
# Ensure pip is available
python -m ensurepip --upgrade
# Uninstall conflicting OTel packages
pip3 uninstall -y opentelemetry-instrumentation-celery \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-redis \
opentelemetry-instrumentation-requests \
opentelemetry-instrumentation-logging \
opentelemetry-instrumentation-wsgi \
opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-asgi \
opentelemetry-instrumentation-sqlalchemy
# Install Alibaba bootstrap (includes the probe)
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ && pip3 config set install.trusted-host mirrors.aliyun.com
pip3 install aliyun-bootstrap && aliyun-bootstrap -a installModify the Dify API container's entrypoint to launch the application with the probe:
exec aliyun-instrument gunicorn \
--bind "${DIFY_BIND_ADDRESS:-0.0.0.0}:${DIFY_PORT:-5001}" \
--workers ${SERVER_WORKER_AMOUNT:-1} \
--worker-class ${SERVER_WORKER_CLASS:-gevent} \
--worker-connections ${SERVER_WORKER_CONNECTIONS:-10} \
--timeout ${GUNICORN_TIMEOUT:-200} \
app:appSet the following environment variables (example values):
ALICLOUD_OTEL_ENDPOINT=https://tracing-cn-heyuan.arms.aliyuncs.com/your-token/api/otlp/traces
ALICLOUD_OTEL_METRICS_ENDPOINT=https://tracing-cn-heyuan.arms.aliyuncs.com/your-token/api/otlp/metrics
ALICLOUD_OTEL_SERVICE_NAME=dify-api
ALICLOUD_OTEL_SAMPLING_RATE=13. Deploy the Go Probe for Plugin‑Daemon and Sandbox
Rebuild the dify-plugin-daemon and dify-sandbox images with instgo instrumentation. Example Dockerfile snippet:
FROM golang:1.23-alpine AS builder
ARG VERSION=unknown
COPY . /app
WORKDIR /app
RUN wget "http://arms-apm-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/instgo/instgo-linux-amd64" -O instgo && chmod 777 instgo
RUN INSTGO_EXTRA_RULES="dify_python" ./instgo go build -o internal/core/runner/python/python.so -buildmode=c-shared -ldflags="-s -w" cmd/lib/python/main.go && \
./instgo go build -o internal/core/runner/nodejs/nodejs.so -buildmode=c-shared -ldflags="-s -w" cmd/lib/nodejs/main.go && \
./instgo go build -o main -ldflags="-s -w" cmd/server/main.goSet OTel environment variables for the daemon and sandbox similarly to the API service, adjusting ALICLOUD_OTEL_SERVICE_NAME (e.g., dify-plugin-daemon or dify-sandbox).
4. Monitor the Worker (Celery)
Enable OTel in the Dify worker by adding:
ENABLE_OTEL=true
OTLP_TRACE_ENDPOINT=${GRPC_ENDPOINT}
OTLP_METRIC_ENDPOINT=${GRPC_ENDPOINT}
OTEL_SAMPLING_RATE=1
APPLICATION_NAME=dify-workerAfter redeploying, the worker's task execution traces appear in CloudMonitor.
5. Add Nginx Tracing (Optional)
Use the pre‑built nginx:*-otel image, load the ngx_otel_module, and configure the exporter:
load_module modules/ngx_otel_module.so;
http {
otel_exporter {
endpoint "${GRPC_ENDPOINT}";
header Authentication "${GRPC_TOKEN}";
}
otel_trace on;
otel_service_name ${SERVICE_NAME};
otel_trace_context propagate;
...
}Linking LLM Traces with Infrastructure Traces
When a workflow execution is displayed in Dify's console, each span contains a Links tab. Selecting a span and opening the Links view reveals the associated infrastructure trace IDs. Clicking “Query associated call chain” jumps to the full‑chain trace that includes API, plugin daemon, sandbox, and worker spans, enabling root‑cause analysis across layers.
Practical Debugging Example
In a scenario where a knowledge‑base retrieval returns empty results without errors, the steps are:
Locate the conversation ID in CloudMonitor and open the corresponding LLM trace.
From the span’s Links tab, follow the trace link to the infrastructure trace.
Filter out sqlalchemy and redis spans to focus on plugin execution.
Identify the failing span (e.g., dify_plugin_execute) and inspect its Details and Events for exception messages.
The trace reveals a mis‑configured Weaviate vector store, allowing rapid remediation.
Monitoring Plugin Performance
Each installed plugin is automatically observed as a separate application named {daemon}_plugin_{plugin}_{version}. The plugin’s execution span includes request parameters, response payloads, and latency, making it easy to spot slow or failing plugins.
Summary
By combining Dify's native monitoring, third‑party tracing, and Alibaba Cloud's non‑intrusive Python/Go probes, developers and operators gain a unified, full‑stack observability platform. The solution supports trace linking, detailed per‑component metrics, and rapid root‑cause analysis for both workflow‑level and infrastructure‑level issues.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
