Operations 19 min read

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.

dbaplus Community

Aug 22, 2023

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

One‑Stop Monitoring Platform in a Multi‑Cloud Architecture

Huolala's Monitor platform aggregates metrics, traces, and logs from applications, machines, containers, middleware (DB, Redis), cloud services, and network devices into a single UI accessible via PC and mobile. Compared with the previous year, most indicators have doubled, while daily alarm volume was reduced to about 4,000 after noise‑reduction work.

Metrics are collected through the Prometheus ecosystem, enhanced by a custom Transformation component for filtering and enrichment, and persisted in Victoriametrics. Traces are captured via SkyWalking with bytecode instrumentation for a wide range of open‑source and internal middleware. Logs are gathered by Filebeat, processed by a proprietary LogProxy, and stored in Elasticsearch and ClickHouse.

The platform also exposes a simple UI and an Open API for third‑party integration.

Multi‑Cloud Monitoring and High‑Availability Design

A real incident in April 2022, where an entire rack of machines in a data center failed, highlighted the need for cloud‑aware monitoring and zone‑level observability. Huolala introduced a "multi‑availability‑zone" design, adding a zone label to all collected metrics via a Prometheus proxy, enabling failure detection per zone without modifying upstream exporters.

Over two to three months, monitoring coverage was extended to five cloud providers, integrating 23 cloud product packages (SLB, OSS, CDN, DDoS protection, etc.) and implementing network‑availability probes such as Smokingping. This turned previously opaque cloud failures into observable events, allowing proactive alerts before the provider resolves the issue.

Second‑Level Business Monitoring

To achieve sub‑second monitoring, Huolala upgraded its log architecture to version 2.0, storing structured logs in ClickHouse and ingesting Nginx access logs for fine‑grained anomaly detection. Business‑critical dimensions such as vehicle model, city, marketing type, and order price are collected separately and joined with monitoring dashboards, enabling SQL‑like queries that can display per‑second or per‑millisecond metrics.

Dashboard configuration and alarm setup are performed through point‑and‑click UI, automatically linking metric queries to alarms. The system also supports dynamic alarm aggregation, merging bursts of related alerts into a single notification based on AppId, host, IP, and other tags.

Unified Alarm Workflow and Intelligent Features

Alarms are delivered to developers via Feishu cards, which include direct links to the corresponding metric curve, trace view, and log details. Clicking a card opens the mobile Monitor UI, allowing on‑the‑go investigation. The platform provides automatic incident analysis: after a failure, it lists recent code changes, checks key business metrics (HTTP/SOA success rates, latency), and correlates them with external service anomalies.

Additional intelligent capabilities include:

Dynamic alarm routing based on AppId, department, or severity.

Five no‑threshold algorithms that use volatility and trend detection instead of static thresholds.

Data replay for the past seven days to validate alarm configurations.

Post‑release health checks that monitor 41 indicators and log anomaly counts, pushing alerts when thresholds are crossed.

Future Foundations: GPT‑Powered Monitoring

Huolala explores integrating ChatGPT for knowledge generation and automation. Use cases demonstrated are:

Monitoring chatbot that answers metric‑related questions by querying open‑source documentation.

Automated weekly stability reports generated from Feishu data and a knowledge base.

Release platform integration where GPT suggests root causes and remediation steps for build failures.

These prototypes illustrate how large‑language models can augment observability by providing instant explanations, generating narratives, and reducing manual triage effort.

Challenges and Outlook

Key challenges remain in cost‑effective data collection, scaling observability pipelines, and continuously integrating emerging AI technologies while keeping the platform product‑focused and user‑centric.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring cloud-native Operations Observability Multi-Cloud GPT

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.