Building a Cloud‑Native Observability Stack for LLM Apps with Alibaba SLS
This article details the engineering practice of constructing a complete data infrastructure for large‑language‑model (LLM) applications using Alibaba Cloud SLS, covering the observability challenges of the Dify platform, the redesign of the architecture, and the resulting improvements in monitoring, diagnosis, and quality optimization.
01 Background: Observability Challenges in LLM Application Development
"Paper knowledge is shallow; true understanding comes from practice." – Lu You
In the rapid development of LLM applications, we often focus on model tuning and feature implementation while overlooking a critical issue: how to effectively monitor, diagnose, and optimize online LLM services.
This article shares the engineering practice of building the SLS SQL Copilot, demonstrating how to create a complete LLM‑application data infrastructure on Alibaba Cloud SLS.
1.1 Rise and Limitations of the Dify Platform
Dify is one of the most popular LLM‑application development platforms, offering visual workflow design and a rich component ecosystem, which greatly lowers the development threshold. Our team chose Dify to build the SQL Copilot, aiming to provide intelligent SQL generation and analysis services.
However, in production we discovered a serious problem: Dify’s observability capabilities are severely lacking. As a team providing observability solutions, we know that good monitoring and diagnosis are essential for service quality, yet Dify’s own monitoring is "so poor it can’t even eat its own dog food".
1.2 Business Complexity of SQL Copilot
SQL Copilot has the following characteristics:
Multiple subsystems : requirement understanding, SQL generation, quality validation, SQL diagnosis, etc.
Complex workflows : nested Dify workflows can trigger multiple sub‑processes per user request.
Dynamic content embedding : extensive use of context embedding and RAG, with prompts containing large amounts of dynamic context and knowledge‑base retrieval.
High concurrency : must support a large number of real‑time query requests.
These features expose the shortcomings of Dify’s existing monitoring and observability capabilities.
2 Pain‑Point Analysis: Dify Platform
2.1 Insufficient Query Capability
Dify only provides basic account query functions, allowing users to search history by user ID or session ID, which is far from the multi‑dimensional, keyword‑based search needed for real‑time troubleshooting.
2.2 Lack of Traceability
Dify’s execution logs are displayed as a flat list without hierarchical traceability, making it difficult to follow nested workflows or correlate upstream and downstream data.
2.3 Poor Content Presentation
The log UI renders large blocks of text (including prompts that can be thousands of characters) as unformatted strings, lacking syntax highlighting, pagination, and easy navigation, which hampers quick diagnosis.
3 Architectural Challenges
3.1 Architecture Scalability
Using PostgreSQL as the primary store works well for OLTP but struggles with massive log data in terms of storage capacity, compute resources, write/read performance, and online scaling.
Data volume grows continuously with user traffic, leading to storage and performance bottlenecks. Sudden traffic spikes can overwhelm PG connections, causing latency or timeouts, while over‑provisioned instances waste 2‑3× cost.
3.2 Data Processing Capability
LLM applications generate massive natural‑language logs (user queries, prompts, model responses). PG’s full‑text search is limited for large‑scale, fuzzy, multi‑dimensional queries, and cannot efficiently handle JSON, Markdown, or other semi‑structured formats.
3.3 Data Diversity Requirements
Different teams (quality, product, algorithm, operations) need varied data access patterns, fine‑grained security, and flexible consumption methods, which PG cannot provide without heavy custom development.
4 Solution: Rebuilding the Data Infrastructure on SLS
4.1 Architecture Reconstruction – Dual Write
We keep the original PG write path unchanged for core business safety, and add an asynchronous write path to SLS. This decouples observability, analysis, and monitoring workloads from the online database, avoiding performance impact while gaining powerful query capabilities.
4.2 SLS Core Capabilities
SLS offers unlimited elastic scaling, pay‑as‑you‑go resources, and native full‑text search, multi‑dimensional queries, and real‑time analytics. It supports various data formats (JSON, Markdown, Text, SQL) with built‑in rendering for readability.
4.3 Practical Implementation – Light‑Weight Trace & Diagnosis System (OneDay + SLS)
Using the internal AICoding platform OneDay, we quickly generated a front‑end UI. The back‑end directly leverages SLS for storage, query, and analysis. The system provides:
End‑to‑end traceability by request ID, showing inputs, processing steps, and outputs.
Intelligent format detection and syntax highlighting for Markdown, JSON, SQL, and plain text.
Full prompt and user query display with navigation, zoom, and sectioning.
5 Production Practice – Data‑Driven Quality Optimization Loop
5.1 Diagnosis & Quality Optimization Process
We built a closed‑loop quality system for SQL Copilot, integrating SLS for data collection, DingTalk AI Tables for manual labeling and analysis, and OneDay for rapid UI generation.
5.2 Key Metrics
SQL Executable Rate : proportion of generated SQL statements that are syntactically correct and can be executed.
Data Validity Rate : proportion of executed SQLs that return meaningful results.
Response Time : total latency from user query to result.
User Satisfaction : composite score from feedback and LLM‑based reverse evaluation.
5.3 Effectiveness
After three months of deployment we achieved:
Problem localization time reduced from ~30 minutes to <5 minutes (≈ 83% improvement).
Root‑cause analysis accuracy increased from 60% to 90% (≈ 50% improvement).
Issue‑fix cycle shortened from 3 days to 1 day (≈ 67% improvement).
SQL executable rate rose from 75% to 85% (+10 pp).
User satisfaction improved from 3.2 to 4.3 (≈ 34% increase).
Monitoring coverage grew from 10% (manual) to 100% (full).
6 Experience & Future Outlook
6.1 AI‑Era Trends
Lightweight Architecture : Fully managed data infrastructure lets teams focus on core innovation; "idea + data + AI" becomes a fast, efficient development model.
Toolchain Integration : Combining SLS, OneDay, and DingTalk AI Tables showcases the power of multi‑tool collaboration, dramatically simplifying the path from idea to implementation.
6.2 Future Directions
Intelligent Upgrades : Leverage historical data to infer user intent, auto‑generate diagnostic reports, and provide optimization suggestions via LLMs.
Smart Root‑Cause Analysis : Reduce manual effort with AI‑driven analysis.
Ecosystem Collaboration : Open‑source the Dify dual‑write implementation, integrate Alibaba Cloud observability into more platforms, and cooperate with other LLM‑application providers.
Conclusion
Rapid LLM development reshapes work and life, but ensuring stability and quality is a major challenge. A robust, capable data infrastructure and observability layer are essential guarantees for successful LLM applications. By sharing our SLS‑based data‑infrastructure practice, we hope to inspire developers to prioritize solid foundations alongside innovative features.
"To do a good job, one must first sharpen the tools." – May every LLM application enjoy comprehensive observability and steady progress.
Author: Zhi Shao | Alibaba Cloud SLS Team
This article is based on real‑world experience from the Alibaba Cloud SLS SQL Copilot project.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
