Artificial Intelligence 23 min read

Building a Cloud‑Native Observability Stack for LLM Apps with Alibaba SLS

This article details the engineering practice of constructing a complete data infrastructure for large‑language‑model (LLM) applications using Alibaba Cloud SLS, covering the observability challenges of the Dify platform, the redesign of the architecture, and the resulting improvements in monitoring, diagnosis, and quality optimization.

Alibaba Cloud Observability

Sep 29, 2025

Building a Cloud‑Native Observability Stack for LLM Apps with Alibaba SLS

01 Background: Observability Challenges in LLM Application Development

"Paper knowledge is shallow; true understanding comes from practice." – Lu You

In the rapid development of LLM applications, we often focus on model tuning and feature implementation while overlooking a critical issue: how to effectively monitor, diagnose, and optimize online LLM services.

This article shares the engineering practice of building the SLS SQL Copilot, demonstrating how to create a complete LLM‑application data infrastructure on Alibaba Cloud SLS.

1.1 Rise and Limitations of the Dify Platform

Dify is one of the most popular LLM‑application development platforms, offering visual workflow design and a rich component ecosystem, which greatly lowers the development threshold. Our team chose Dify to build the SQL Copilot, aiming to provide intelligent SQL generation and analysis services.

However, in production we discovered a serious problem: Dify’s observability capabilities are severely lacking. As a team providing observability solutions, we know that good monitoring and diagnosis are essential for service quality, yet Dify’s own monitoring is "so poor it can’t even eat its own dog food".

1.2 Business Complexity of SQL Copilot

SQL Copilot has the following characteristics:

Multiple subsystems : requirement understanding, SQL generation, quality validation, SQL diagnosis, etc.

Complex workflows : nested Dify workflows can trigger multiple sub‑processes per user request.

Dynamic content embedding : extensive use of context embedding and RAG, with prompts containing large amounts of dynamic context and knowledge‑base retrieval.

High concurrency : must support a large number of real‑time query requests.

These features expose the shortcomings of Dify’s existing monitoring and observability capabilities.

2 Pain‑Point Analysis: Dify Platform

2.1 Insufficient Query Capability

Dify only provides basic account query functions, allowing users to search history by user ID or session ID, which is far from the multi‑dimensional, keyword‑based search needed for real‑time troubleshooting.

2.2 Lack of Traceability

Dify’s execution logs are displayed as a flat list without hierarchical traceability, making it difficult to follow nested workflows or correlate upstream and downstream data.

2.3 Poor Content Presentation

The log UI renders large blocks of text (including prompts that can be thousands of characters) as unformatted strings, lacking syntax highlighting, pagination, and easy navigation, which hampers quick diagnosis.

3 Architectural Challenges

3.1 Architecture Scalability

Using PostgreSQL as the primary store works well for OLTP but struggles with massive log data in terms of storage capacity, compute resources, write/read performance, and online scaling.

Data volume grows continuously with user traffic, leading to storage and performance bottlenecks. Sudden traffic spikes can overwhelm PG connections, causing latency or timeouts, while over‑provisioned instances waste 2‑3× cost.

3.2 Data Processing Capability

LLM applications generate massive natural‑language logs (user queries, prompts, model responses). PG’s full‑text search is limited for large‑scale, fuzzy, multi‑dimensional queries, and cannot efficiently handle JSON, Markdown, or other semi‑structured formats.

3.3 Data Diversity Requirements

Different teams (quality, product, algorithm, operations) need varied data access patterns, fine‑grained security, and flexible consumption methods, which PG cannot provide without heavy custom development.

4 Solution: Rebuilding the Data Infrastructure on SLS

4.1 Architecture Reconstruction – Dual Write

We keep the original PG write path unchanged for core business safety, and add an asynchronous write path to SLS. This decouples observability, analysis, and monitoring workloads from the online database, avoiding performance impact while gaining powerful query capabilities.

4.2 SLS Core Capabilities

SLS offers unlimited elastic scaling, pay‑as‑you‑go resources, and native full‑text search, multi‑dimensional queries, and real‑time analytics. It supports various data formats (JSON, Markdown, Text, SQL) with built‑in rendering for readability.

4.3 Practical Implementation – Light‑Weight Trace & Diagnosis System (OneDay + SLS)

Using the internal AICoding platform OneDay, we quickly generated a front‑end UI. The back‑end directly leverages SLS for storage, query, and analysis. The system provides:

End‑to‑end traceability by request ID, showing inputs, processing steps, and outputs.

Intelligent format detection and syntax highlighting for Markdown, JSON, SQL, and plain text.

Full prompt and user query display with navigation, zoom, and sectioning.

5 Production Practice – Data‑Driven Quality Optimization Loop

5.1 Diagnosis & Quality Optimization Process

We built a closed‑loop quality system for SQL Copilot, integrating SLS for data collection, DingTalk AI Tables for manual labeling and analysis, and OneDay for rapid UI generation.

5.2 Key Metrics

SQL Executable Rate : proportion of generated SQL statements that are syntactically correct and can be executed.

Data Validity Rate : proportion of executed SQLs that return meaningful results.

Response Time : total latency from user query to result.

User Satisfaction : composite score from feedback and LLM‑based reverse evaluation.

5.3 Effectiveness

After three months of deployment we achieved:

Problem localization time reduced from ~30 minutes to <5 minutes (≈ 83% improvement).

Root‑cause analysis accuracy increased from 60% to 90% (≈ 50% improvement).

Issue‑fix cycle shortened from 3 days to 1 day (≈ 67% improvement).

SQL executable rate rose from 75% to 85% (+10 pp).

User satisfaction improved from 3.2 to 4.3 (≈ 34% increase).

Monitoring coverage grew from 10% (manual) to 100% (full).

6 Experience & Future Outlook

6.1 AI‑Era Trends

Lightweight Architecture : Fully managed data infrastructure lets teams focus on core innovation; "idea + data + AI" becomes a fast, efficient development model.

Toolchain Integration : Combining SLS, OneDay, and DingTalk AI Tables showcases the power of multi‑tool collaboration, dramatically simplifying the path from idea to implementation.

6.2 Future Directions

Intelligent Upgrades : Leverage historical data to infer user intent, auto‑generate diagnostic reports, and provide optimization suggestions via LLMs.

Smart Root‑Cause Analysis : Reduce manual effort with AI‑driven analysis.

Ecosystem Collaboration : Open‑source the Dify dual‑write implementation, integrate Alibaba Cloud observability into more platforms, and cooperate with other LLM‑application providers.

Conclusion

Rapid LLM development reshapes work and life, but ensuring stability and quality is a major challenge. A robust, capable data infrastructure and observability layer are essential guarantees for successful LLM applications. By sharing our SLS‑based data‑infrastructure practice, we hope to inspire developers to prioritize solid foundations alongside innovative features.

"To do a good job, one must first sharpen the tools." – May every LLM application enjoy comprehensive observability and steady progress.

Author: Zhi Shao | Alibaba Cloud SLS Team

This article is based on real‑world experience from the Alibaba Cloud SLS SQL Copilot project.

cloud-native LLM Dify SLS data infrastructure SQL Copilot

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.