Operations 27 min read

How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring

This guide explains how to achieve end‑to‑end observability for Dify low‑code LLM applications by combining Dify's built‑in monitoring, third‑party tracing services like Langfuse, and Alibaba Cloud's CloudMonitor with Python and Go probes, covering component‑level tracing, configuration steps, and trace linking for debugging and performance optimization.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
How to Build Full‑Stack Observability for Dify LLM Apps Using Alibaba Cloud Monitoring

Background and Challenges

Dify is a popular low‑code LLM application platform that integrates model support, prompt orchestration, RAG engines, workflow/agent frameworks, and plugin ecosystems. Production‑grade Agentic applications involve many dynamic elements—session history, memory handling, tool calls, knowledge‑base retrieval, model generation, script execution, and flow control—making their behavior highly unpredictable.

Observability is essential throughout the lifecycle of Agentic app development, debugging, operations, and iterative optimization. It must serve two perspectives: developers who need detailed workflow execution metrics and operators who monitor the Dify cluster, its components, and upstream/downstream dependencies.

Current Observability Gaps

Built‑in Application Monitoring : Collects execution details from Dify's engine and stores them in the Dify database. It offers tight integration but suffers from limited query capabilities (only session ID or user ID) and performance degradation at scale due to heavy DB writes.

Third‑Party Tracing Services (e.g., Langfuse, LangSmith): Integrated via Dify's custom OpsTrace mechanism. They provide workflow‑level trace data but lack full‑chain visibility, have coarse data granularity, and cannot be easily linked to Dify's internal traces.

Native OpenTelemetry (OTel) Support : Dify ships with OTel for Flask, HTTP, Redis, DB, and Celery, but it only instruments framework layers, omits internal workflow logic, and cannot be correlated with the third‑party traces.

Full‑Panorama Observability Solution

To bridge these gaps, a combined solution is proposed:

Deploy non‑intrusive Python and Go probes that automatically instrument Dify's execution engine, plugin daemon, sandbox, and worker processes.

Enable Dify's official cloud‑monitoring integration (Langfuse/CloudMonitor) to capture LLM‑level workflow traces.

Use Alibaba Cloud's Trace Link feature to associate workflow traces with infrastructure traces, achieving end‑to‑end traceability.

Key Components Monitored

API service (Flask/HTTP)

Plugin daemon (Go)

Sandbox (Python/Node.js)

Worker (Celery)

Nginx gateway

Step‑by‑Step Integration

1. Prepare Alibaba Cloud Credentials

Obtain a LicenseKey and Endpoint from the CloudMonitor 2.0 console (or ARMS for older versions). Record the values for later use.

2. Install and Configure the Python Probe

# Ensure pip is available
python -m ensurepip --upgrade
# Uninstall conflicting OTel packages
pip3 uninstall -y opentelemetry-instrumentation-celery \
    opentelemetry-instrumentation-flask \
    opentelemetry-instrumentation-redis \
    opentelemetry-instrumentation-requests \
    opentelemetry-instrumentation-logging \
    opentelemetry-instrumentation-wsgi \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-instrumentation-asgi \
    opentelemetry-instrumentation-sqlalchemy
# Install Alibaba bootstrap (includes the probe)
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ && pip3 config set install.trusted-host mirrors.aliyun.com
pip3 install aliyun-bootstrap && aliyun-bootstrap -a install

Modify the Dify API container's entrypoint to launch the application with the probe:

exec aliyun-instrument gunicorn \
    --bind "${DIFY_BIND_ADDRESS:-0.0.0.0}:${DIFY_PORT:-5001}" \
    --workers ${SERVER_WORKER_AMOUNT:-1} \
    --worker-class ${SERVER_WORKER_CLASS:-gevent} \
    --worker-connections ${SERVER_WORKER_CONNECTIONS:-10} \
    --timeout ${GUNICORN_TIMEOUT:-200} \
    app:app

Set the following environment variables (example values):

ALICLOUD_OTEL_ENDPOINT=https://tracing-cn-heyuan.arms.aliyuncs.com/your-token/api/otlp/traces
ALICLOUD_OTEL_METRICS_ENDPOINT=https://tracing-cn-heyuan.arms.aliyuncs.com/your-token/api/otlp/metrics
ALICLOUD_OTEL_SERVICE_NAME=dify-api
ALICLOUD_OTEL_SAMPLING_RATE=1

3. Deploy the Go Probe for Plugin‑Daemon and Sandbox

Rebuild the dify-plugin-daemon and dify-sandbox images with instgo instrumentation. Example Dockerfile snippet:

FROM golang:1.23-alpine AS builder
ARG VERSION=unknown
COPY . /app
WORKDIR /app
RUN wget "http://arms-apm-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/instgo/instgo-linux-amd64" -O instgo && chmod 777 instgo
RUN INSTGO_EXTRA_RULES="dify_python" ./instgo go build -o internal/core/runner/python/python.so -buildmode=c-shared -ldflags="-s -w" cmd/lib/python/main.go && \
    ./instgo go build -o internal/core/runner/nodejs/nodejs.so -buildmode=c-shared -ldflags="-s -w" cmd/lib/nodejs/main.go && \
    ./instgo go build -o main -ldflags="-s -w" cmd/server/main.go

Set OTel environment variables for the daemon and sandbox similarly to the API service, adjusting ALICLOUD_OTEL_SERVICE_NAME (e.g., dify-plugin-daemon or dify-sandbox).

4. Monitor the Worker (Celery)

Enable OTel in the Dify worker by adding:

ENABLE_OTEL=true
OTLP_TRACE_ENDPOINT=${GRPC_ENDPOINT}
OTLP_METRIC_ENDPOINT=${GRPC_ENDPOINT}
OTEL_SAMPLING_RATE=1
APPLICATION_NAME=dify-worker

After redeploying, the worker's task execution traces appear in CloudMonitor.

5. Add Nginx Tracing (Optional)

Use the pre‑built nginx:*-otel image, load the ngx_otel_module, and configure the exporter:

load_module modules/ngx_otel_module.so;
http {
    otel_exporter {
        endpoint "${GRPC_ENDPOINT}";
        header Authentication "${GRPC_TOKEN}";
    }
    otel_trace on;
    otel_service_name ${SERVICE_NAME};
    otel_trace_context propagate;
    ...
}

Linking LLM Traces with Infrastructure Traces

When a workflow execution is displayed in Dify's console, each span contains a Links tab. Selecting a span and opening the Links view reveals the associated infrastructure trace IDs. Clicking “Query associated call chain” jumps to the full‑chain trace that includes API, plugin daemon, sandbox, and worker spans, enabling root‑cause analysis across layers.

Practical Debugging Example

In a scenario where a knowledge‑base retrieval returns empty results without errors, the steps are:

Locate the conversation ID in CloudMonitor and open the corresponding LLM trace.

From the span’s Links tab, follow the trace link to the infrastructure trace.

Filter out sqlalchemy and redis spans to focus on plugin execution.

Identify the failing span (e.g., dify_plugin_execute) and inspect its Details and Events for exception messages.

The trace reveals a mis‑configured Weaviate vector store, allowing rapid remediation.

Monitoring Plugin Performance

Each installed plugin is automatically observed as a separate application named {daemon}_plugin_{plugin}_{version}. The plugin’s execution span includes request parameters, response payloads, and latency, making it easy to spot slow or failing plugins.

Summary

By combining Dify's native monitoring, third‑party tracing, and Alibaba Cloud's non‑intrusive Python/Go probes, developers and operators gain a unified, full‑stack observability platform. The solution supports trace linking, detailed per‑component metrics, and rapid root‑cause analysis for both workflow‑level and infrastructure‑level issues.

image
image
MonitoringobservabilityOpenTelemetryDifyAlibaba CloudPython probe
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.