Operations 21 min read

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

This guide explains how to design and implement a complete observability solution for large‑model AI services on Alibaba Cloud, covering architecture, core metrics, logging standards, demo code, log collection, dashboard design, alerting, monitoring tools, troubleshooting SOPs, and recovery procedures.

Alibaba Cloud Developer

Sep 12, 2025

How to Build End‑to‑End Observability for Large‑Model Applications on Alibaba Cloud

Background

Large‑model technology has rapidly matured and is being deployed across industries. As these models are integrated into applications, establishing an end‑to‑end observability system becomes increasingly critical.

Overall Framework

The observability solution for large‑model applications extends traditional monitoring by incorporating model‑specific characteristics. The overall framework follows Alibaba's 1‑5‑10 observability methodology.

Core Metrics

Observability metrics are divided into three categories: availability, performance, and business feedback. Key indicators include resource water level (QPM and token usage), analysis dimensions (application, module, model, workspace), and user negative feedback rate.

Resource water level: Monitor QPM and token usage; Bailian currently does not support alerts on these, so users must log and aggregate them manually.

Analysis dimensions (application, module): Log these fields to differentiate issues across multiple apps or modules sharing the same cloud account.

Analysis dimensions (model, workspace): Bailian supports model‑plus‑workspace level flow‑control.

User negative feedback rate: Similar to e‑commerce metrics, this reflects user experience issues.

Findings (1)

Observability for large models currently focuses on monitoring and alerting, split into business monitoring and cloud‑product monitoring.

Business Monitoring

Log custom information during model calls (prompt, model name, etc.) and use Log Service, QuickBI, DataV, or CloudMonitor to build dashboards and custom alerts.

Cloud‑Product Monitoring

Leverage Bailian's model and application observability modules together with CloudMonitor and ARMS to provide standard monitoring capabilities.

Logging Standards

Log Printing

Log Printing Specification

Key fields to log before and after model invocation include call time, prompt, model, workspace, request ID, status code, duration, input/output tokens, error code, and error message.

<property name="LOG_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p [traceId:%X{traceId}] [%c{3}#%method():%L] [%thread] - %m%n" />

Example Java code:

long t1 = System.currentTimeMillis();
try {
    Generation gen = new Generation();
    Message systemMsg = Message.builder()
        .role(Role.SYSTEM.getValue())
        .content("You are a helpful assistant.")
        .build();
    Message userMsg = Message.builder()
        .role(Role.USER.getValue())
        .content(prompt)
        .build();
    GenerationParam param = GenerationParam.builder()
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .model(model)
        .workspace(workspace)
        .messages(Arrays.asList(systemMsg, userMsg))
        .resultFormat(GenerationParam.ResultFormat.MESSAGE)
        .build();
    logger.debug("prompt:{}", prompt);
    GenerationResult result = gen.call(param);
    long t2 = System.currentTimeMillis();
    logger.info("{},{},{},{},{},{},{},{},{},{},{},{}",
        app, module, model, workspace, result.getRequestId(), 200,
        t2 - t1, result.getUsage().getInputTokens(),
        result.getUsage().getOutputTokens(), result.getUsage().getTotalTokens(), "", "");
    logger.debug("result:{}", JSON.toJSONString(result));
    return result.getOutput().getChoices().get(0).getMessage().getContent();
} catch (ApiException e) {
    long t2 = System.currentTimeMillis();
    logger.error("{},{},{},{},{},{},{},{},{},{},{},{}",
        app, module, model, workspace, e.getStatus().getRequestId(),
        e.getStatus().getStatusCode(), t2 - t1, 0, 0, 0,
        e.getStatus().getCode(), e.getStatus().getMessage());
    throw new RuntimeException(e.getLocalizedMessage());
} catch (InputRequiredException e) {
    long t2 = System.currentTimeMillis();
    logger.error("{},{},{},{},{},{},{},{},{},{},{},{}",
        app, module, model, workspace, "", "", t2 - t1, 0, 0, 0,
        "InputRequired", e.getLocalizedMessage());
    throw new RuntimeException(e.getLocalizedMessage());
} catch (NoApiKeyException e) {
    long t2 = System.currentTimeMillis();
    logger.error("{},{},{},{},{},{},{},{},{},{},{},{}",
        app, module, model, workspace, "", "", t2 - t1, 0, 0, 0,
        "NoApiKey", e.getLocalizedMessage());
    throw new RuntimeException(e.getLocalizedMessage());
}

Resulting log entries:

2025-07-09 14:23:24.748 INFO  [traceId:41ea65e3-0b4c-4b65-8567-325a6128eddf] [c.a.c.s.i.ChatServiceImpl#chat():67] [http-nio-8080-exec-1] - bailian-log,Chat,qwen-plus,llm-6qtsrlugmu6wanfs,6c958db4-d489-9de8-b5d4-9fc872f24ce2,200,1141,24,9,33,,

2025-07-09 08:26:33.833 ERROR  [traceId:63384a39-bbfe-489b-a290-6635975fc633] [c.a.c.s.i.ChatServiceImpl#chat():72] [http-nio-8080-exec-3] - bailian-log,Chat,qwen-plus,llm-6qtsrlugmu6wanfs,83a0cb5d-07f9-9fd2-9985-1c05b339f5d5,401,38,0,0,0,InvalidApiKey,Invalid API-key provided.

Log Collection & Storage

Use Alibaba Cloud Log Service (SLS) for log ingestion. Logs can be uploaded via Logtail parsing or SDK/Log4j/Logback plugins. Create a Logstore to store logs; the demo uses auto‑generated indexes.

Observability Dashboard

The dashboard addresses two main questions: (1) What is the current usage and are there any issues? (2) If problems exist, where do they originate?

Key Dashboard Indicators

Business Calls : Total calls, error distribution, latency.

Resource Water Level : QPM and token usage, with overall, core‑model, and core‑model‑workspace views.

Business Feedback : User negative‑feedback rate and trend analysis.

Examples of visualizations (call volume, error codes, latency, model‑level drill‑downs) are shown in the embedded images.

Alerting

Configure alerts directly from the Log Service dashboard. Recommended alert metrics include failure count / success rate, response time, and resource water level (QPM / token usage). Alerts can be scoped to specific models, applications, or workspaces.

Monitoring Solutions

Bailian : Provides standard monitoring and alerting for model and application observability, including call counts, failures, average latency, and model‑level statistics.

CloudMonitor : Offers model‑centric metrics such as call volume, failure count, and average latency.

ARMS : Application‑level real‑time monitoring, tracing API call chains, success rates, and detailed request information.

All solutions are continuously evolving; stay updated with new cloud product features.

Troubleshooting SOP

Typical steps: (1) Gather incident details from alerts or user reports. (2) Identify incident type (resource water‑level, error code, latency, etc.). (3) Use logs (TraceID) and core metrics to narrow scope. (4) Drill down by application, model, or workspace. (5) Launch appropriate emergency response. (6) Verify recovery.

Recovery

For rate‑limit incidents, possible actions include: (1) Apply client‑side throttling and retries. (2) Increase workspace quota in Bailian. (3) Submit an expansion request to Alibaba Cloud with UID, model name, and business impact details.

Diagnostic Tools

Combine Bailian, CloudMonitor, ARMS, and custom dashboards with user logs for comprehensive diagnosis. If issues cannot be resolved by the user, Alibaba Cloud support can provide additional backend logs and tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability large language models Alibaba Cloud AI Operations cloud monitoring

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.