Cloud Native 16 min read

Enable Python Probe for LLM Observability on Alibaba Cloud ACK

This guide explains how to integrate Alibaba Cloud's Python probe into a Kubernetes (ACK) environment to monitor large language model (LLM) applications, covering prerequisites, installation steps, Dockerfile modifications, resource permissions, and sample Python code for both server and client components.

Alibaba Cloud Observability

Nov 8, 2024

Enable Python Probe for LLM Observability on Alibaba Cloud ACK

Background

As large language model (LLM) technology matures, many enterprises embed LLMs into their products. However, the opaque internal mechanisms of LLMs pose risks and hinder deployment. Observability of LLMs provides essential data for explainability, bias detection, risk mitigation, and performance improvement.

Application Example

A company adds an intelligent Q&A feature to its product. The architecture consists of a client sending a request to a server, which calls a chatbot that performs Retrieval‑Augmented Generation (RAG) before responding. To observe this LLM workflow, the company integrates Alibaba Cloud's Python probe.

Prerequisites

Create a Kubernetes (ACK) cluster – you may choose a dedicated, managed, or serverless cluster.

Create a namespace (e.g., arms-demo) following the namespace management guide.

Verify your Python version and framework meet the probe compatibility requirements.

Step 1: Install ARMS Application Monitoring Component

Select the target cluster from the cluster list.

Navigate to Operations > Component Management and search for ack-onepilot (version ≥ 3.2.4).

Click Install on the ack-onepilot card.

Accept the default parameters and confirm the installation.

Step 2: Modify Dockerfile

Install the probe and run the instrumented command.

pip3 install aliyun-bootstrap

aliyun-bootstrap -a install

aliyun-instrument python app.py

Example Dockerfile (before):

# Use Python 3.10 base image
FROM docker.m.daocloud.io/python:3.10
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ./app.py /app/app.py
EXPOSE 8000
CMD ["python","app.py"]

Example Dockerfile (after adding the probe):

# Use official Python 3.10 base image
FROM docker.m.daocloud.io/python:3.10
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install aliyun python probe
RUN pip3 install aliyun-bootstrap && aliyun-bootstrap -a install
COPY ./app.py /app/app.py
EXPOSE 8000
CMD ["aliyun-instrument","python","app.py"]

Step 3: Grant ARMS Resource Access

If your cluster lacks an ARMS Addon Token, manually grant ARMS permissions:

Go to Cluster > Resources > Worker RAM Role and add a new authorization.

Select the AliyunARMSFullAccess permission (and AliyunSTSAssumeRoleAccess for dedicated clusters) and confirm.

After installing ack-onepilot, fill in the AccessKey and AccessKeySecret of an account with ARMS permissions in the component’s configuration page.

Step 4: Enable ARMS Monitoring for Python Application

Add the following labels to the pod template in your Deployment YAML:

labels:
  aliyun.com/app-language: python  # Required for Python apps
  armsPilotAutoEnable: 'on'
  armsPilotCreateAppName: "<your-deployment-name>"

Deploy the updated resources and restart the pods.

Monitoring Results

After the container redeploys (1–2 minutes), view metrics in the ARMS console under Application Monitoring > Application List . The console provides call‑chain analysis, trace views for micro‑service and LLM scenarios, and customizable alerts.

Compatibility

The probe supports Python >= 3.8 and works with common frameworks such as FastAPI.

Sample Code

arms-python-server

import uvicorn
from fastapi import FastAPI, HTTPException
from logging import getLogger
from concurrent import futures
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
_logger = getLogger(__name__)
import requests, os

def call_requests():
    url = 'https://www.aliyun.com'
    call_url = os.getenv('CALL_URL') or url
    response = requests.get(call_url)
    response.raise_for_status()
    print(f"response code: {response.status_code} - {response.text}")

app = FastAPI()

def call_client():
    _logger.warning("calling client")
    url = 'https://www.aliyun.com'
    call_url = os.getenv('CLIENT_URL') or url
    response = requests.get(call_url)
    return response.text

@app.get("/")
async def call():
    with tracer.start_as_current_span("parent") as rootSpan:
        rootSpan.set_attribute("parent.value", "parent")
        with futures.ThreadPoolExecutor(max_workers=2) as executor:
            with tracer.start_as_current_span("ThreadPoolExecutorTest") as span:
                span.set_attribute("future.value", "ThreadPoolExecutorTest")
                future = executor.submit(call_client)
                future.result()
    return {"data": "call"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

arms-python-client

from fastapi import FastAPI
from langchain.llms.fake import FakeListLLM
import uvicorn
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

app = FastAPI()
llm = FakeListLLM(responses=["I'll callback later.", "You 'console' them!"])

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"

@app.get("/")
def call_langchain():
    res = llm_chain.run(question)
    return {"data": res}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)