Operations 18 min read

How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

This article details Huolala's journey from basic monitoring to an AI‑driven intelligent observability platform, covering AIOps concepts, a comprehensive monitoring framework, practical implementations, automated alert analysis, lessons learned, and future directions for large‑scale operations.

Huolala Tech
Huolala Tech
Huolala Tech
How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

Introduction

Ko Sheng, head of Huolala's monitoring platform, shares his experience building the company's monitoring system after joining the team a year ago. Huolala operates in 352 cities with 660,000 active drivers and 8.4 million daily users, using a diverse tech stack (Java, PHP, Golang, C++).

AIOps and Intelligent Monitoring

AIOps, Observability, and Monitoring differ: traditional monitoring collects and displays data, Observability aggregates system‑wide metrics, while AIOps applies intelligent algorithms for automation. "Intelligent monitoring" combines full‑stack observability with AI‑driven analysis to meet monitoring goals.

Huolala's Intelligent Monitoring Framework

The framework examines monitoring from four perspectives: business functions (providing metrics, alerts, and troubleshooting support), data elements (logs, traces, metrics, events), personnel & organization (developers, NOC, stability teams, and leadership), and feature requirements (accuracy, timeliness, automation, scalability, and high availability). An illustration of core monitoring elements is provided.

Framework diagram
Framework diagram

Intelligent Monitoring Practice

Huolala's one‑stop monitoring platform, Monitor , integrates metrics, tracing, logs, and alerts. It aggregates 7 TB of metrics, 23 TB of trace data, and 150 TB of logs daily, supporting over 7,000 custom alert rules and 600+ daily users.

Key components:

Metrics: Prometheus ecosystem with remote‑write to VictoriaMetrics, plus custom transformation and proxy services.

Trace: SkyWalking SDK with Kafka ingestion, storing indexes in Elasticsearch and raw traces in HBase.

Logs: Filebeat + Logstash pipeline feeding Elasticsearch.

Data query: Dedicated API services for each data type.

AIOps API: Builds application topology stored in a graph database.

Architecture diagram
Architecture diagram

Smart Monitoring Examples

In an incident scenario, an alert triggers NOC staff to view the affected app's dashboard, notice a spike in the soa.rt metric, click the trace link to see a downstream timeout, and jump to the related log via App+TraceId. The log reveals a misconfiguration, which is rolled back to restore service.

Automation steps added after the alert include health analysis, recent change detection, downstream dependency checks, and pushing results to Feishu with suggested remediation actions.

Alert Automation and AI Layers

The alert pipeline performs:

Health analysis across error rates, RT, QPS, etc.

Change detection within the last 30 minutes.

Downstream dependency health assessment.

The system is organized into four layers:

Foundation: noise reduction, silencing, audit.

Algorithm: smoothing, threshold‑free detection, anomaly interpolation.

Query: metric, metadata, topology retrieval with caching.

Rule: global templates, customizable per‑team rules, and composable alert models.

AI layer diagram
AI layer diagram

Experience and Reflection

Design instrumentation standards early with extensibility.

Prioritize user‑centric design; provide SQL‑like query wrappers and visual alert configuration for developers.

Maintain system transparency and self‑service troubleshooting guides.

Balance cost vs. benefit; focus on high‑value data collection.

Summary and Outlook

Huolala achieved 100 % coverage of core applications and alerts, with a 99.98 % service availability in recent months. Most incidents were detected within 5 minutes and resolved within 25 minutes. Future plans include further AI‑driven alert models, expanded graph‑based knowledge graphs, and evolving the monitoring platform into modular services for larger scale operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsObservabilityDevOpsaiopsintelligent monitoringHuolala
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.