Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions
This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.
One‑Stop Monitoring Platform in a Multi‑Cloud Architecture
Huolala's Monitor platform aggregates metrics, traces, and logs from applications, machines, containers, middleware (DB, Redis), cloud services, and network devices into a single UI accessible via PC and mobile. Compared with the previous year, most indicators have doubled, while daily alarm volume was reduced to about 4,000 after noise‑reduction work.
Metrics are collected through the Prometheus ecosystem, enhanced by a custom Transformation component for filtering and enrichment, and persisted in Victoriametrics. Traces are captured via SkyWalking with bytecode instrumentation for a wide range of open‑source and internal middleware. Logs are gathered by Filebeat, processed by a proprietary LogProxy, and stored in Elasticsearch and ClickHouse.
The platform also exposes a simple UI and an Open API for third‑party integration.
Multi‑Cloud Monitoring and High‑Availability Design
A real incident in April 2022, where an entire rack of machines in a data center failed, highlighted the need for cloud‑aware monitoring and zone‑level observability. Huolala introduced a "multi‑availability‑zone" design, adding a zone label to all collected metrics via a Prometheus proxy, enabling failure detection per zone without modifying upstream exporters.
Over two to three months, monitoring coverage was extended to five cloud providers, integrating 23 cloud product packages (SLB, OSS, CDN, DDoS protection, etc.) and implementing network‑availability probes such as Smokingping. This turned previously opaque cloud failures into observable events, allowing proactive alerts before the provider resolves the issue.
Second‑Level Business Monitoring
To achieve sub‑second monitoring, Huolala upgraded its log architecture to version 2.0, storing structured logs in ClickHouse and ingesting Nginx access logs for fine‑grained anomaly detection. Business‑critical dimensions such as vehicle model, city, marketing type, and order price are collected separately and joined with monitoring dashboards, enabling SQL‑like queries that can display per‑second or per‑millisecond metrics.
Dashboard configuration and alarm setup are performed through point‑and‑click UI, automatically linking metric queries to alarms. The system also supports dynamic alarm aggregation, merging bursts of related alerts into a single notification based on AppId, host, IP, and other tags.
Unified Alarm Workflow and Intelligent Features
Alarms are delivered to developers via Feishu cards, which include direct links to the corresponding metric curve, trace view, and log details. Clicking a card opens the mobile Monitor UI, allowing on‑the‑go investigation. The platform provides automatic incident analysis: after a failure, it lists recent code changes, checks key business metrics (HTTP/SOA success rates, latency), and correlates them with external service anomalies.
Additional intelligent capabilities include:
Dynamic alarm routing based on AppId, department, or severity.
Five no‑threshold algorithms that use volatility and trend detection instead of static thresholds.
Data replay for the past seven days to validate alarm configurations.
Post‑release health checks that monitor 41 indicators and log anomaly counts, pushing alerts when thresholds are crossed.
Future Foundations: GPT‑Powered Monitoring
Huolala explores integrating ChatGPT for knowledge generation and automation. Use cases demonstrated are:
Monitoring chatbot that answers metric‑related questions by querying open‑source documentation.
Automated weekly stability reports generated from Feishu data and a knowledge base.
Release platform integration where GPT suggests root causes and remediation steps for build failures.
These prototypes illustrate how large‑language models can augment observability by providing instant explanations, generating narratives, and reducing manual triage effort.
Challenges and Outlook
Key challenges remain in cost‑effective data collection, scaling observability pipelines, and continuously integrating emerging AI technologies while keeping the platform product‑focused and user‑centric.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
