Operations 19 min read

How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform

This article details Huolala's journey from a fragmented monitoring stack to a unified, AI‑enhanced observability platform, covering AIOps concepts, the design of a comprehensive monitoring framework, concrete implementation of metrics, tracing, logging, alerting, and lessons learned for large‑scale operations.

dbaplus Community
dbaplus Community
dbaplus Community
How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform

Background

Huolala operates in 352 cities with >660 000 active drivers and >8 million daily users. Its services are built with Java, PHP, Go, C++ and other languages, requiring a monitoring system that scales with rapid business growth while providing unified observability.

AIOps and Intelligent Monitoring

Traditional monitoring collects and visualises metrics. Observability extends this to full‑stack data (metrics, traces, logs, events) and metadata. AIOps applies AI/ML algorithms to automate operations. "Intelligent monitoring" is the intersection of observability and AIOps: it consumes multi‑layer data and uses intelligent techniques to achieve monitoring goals such as real‑time alerting, root‑cause analysis and automated remediation.

Huolala Intelligent‑Monitoring Framework

The framework evaluates monitoring from four dimensions:

Business functions – provide developers with multi‑dimensional application data, real‑time alerts, and support both emergency handling and daily stability operations.

Data elements – ingest metrics, traces, logs, events, plus metadata (business type, importance, ownership, topology, infrastructure).

People & organization – serve developers, NOC, stability teams and executives, each with distinct needs.

Feature requirements – guarantee data accuracy, timely alerts, automated pipelines, flexible architecture, high availability, and stable data export for downstream systems.

The architecture is layered: data collection → transformation → storage → query → AI‑driven analysis.

Implementation – the "Monitor" Platform

Monitor is a one‑stop platform built on open‑source components (Prometheus, VictoriaMetrics, SkyWalking, Elasticsearch, HBase) plus custom services.

Metrics – Prometheus remote‑write feeds VictoriaMetrics; a custom transformation component trims payloads and a proxy component accelerates queries and enforces rate limits.

Tracing – SkyWalking SDK injected via byte‑code, data sent through Kafka, indexed in Elasticsearch, raw traces stored in HBase.

Logging – Filebeat → Logstash → custom consumer writes logs to Elasticsearch.

Data query – Dedicated API services for each data type; an AIOps API builds an application topology stored in a graph database.

Current scale: ~7 TB of metrics, 23 TB of trace data, 150 TB of logs per day; >7 000 custom alert rules; >600 daily active users.

Intelligent Alerting Workflow

Analyse health indicators (error count, HTTP/SOA success rate, latency, QPS).

Detect configuration changes within the last 30 minutes.

Evaluate downstream application health using the topology graph.

Results are pushed to Feishu; future work will attach concrete remediation steps and run‑book links.

Adaptive Thresholds & Noise Reduction

Static expert thresholds are being replaced by smoothing algorithms and machine‑learning models that adapt to traffic patterns. Noise reduction combines time‑based suppression, type‑based aggregation, and per‑application aggregation.

Layered Intelligent‑Alert Architecture

Base layer – de‑duplication, silencing, audit logging.

Algorithm layer – smoothing, anomaly detection, interpolation for missing data.

Query layer – unified access to metrics, metadata, topology with caching.

Rule layer – global alert templates, team‑specific custom rules, composable alert models.

Practices & Reflections

Define instrumentation standards early to avoid costly retrofits.

Design for developers: provide SQL‑like wrappers for PromQL and visual alert configuration.

Keep the system transparent; enable self‑diagnosis.

Balance cost vs. benefit – not every data point needs to be collected.

Monitoring coverage for core applications reached 100 %; overall service availability is 99.98 %. In the observed period, 100 % of incidents were detected within 5 minutes, 89 % were located within 20 minutes, and 78 % were resolved within 25 minutes.

Summary & Outlook

The evolution from ad‑hoc tools to a mature AI‑enhanced platform demonstrates the necessity of unified observability, scalable architecture, and continuous automation. Planned next steps include deeper AIOps model training, development of proprietary time‑series and tracing stores, and modularising monitoring into dedicated services for metrics, logs and stability as the organisation scales.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringObservabilityDevOpscloudaiops
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.