How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud
This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.
Introduction
This article is based on Cao Wei's live streaming talk "Stability and High‑Availability Construction of Core Infrastructure in Multi‑Cloud Scenarios" and summarizes the key points of HuoLala's monitoring system.
1. Monitoring Evolution
The monitoring industry started with eBay's CAL (2002), followed by Google's Dapper (2010) and commercial products like Datadog. Subsequent open‑source projects include CAT (2011), Zipkin (2012), EagleEye, Pinpoint, EMONITOR (2014), SkyWalking (2015), and Jaeger (2016).
2. HuoLala Monitoring Evolution
HuoLala's monitoring history is divided into three stages:
Monitoring 1.0 : Independent Prometheus instances per team, no unified trace, low efficiency.
Monitoring 2.0 : Standardized metrics, full‑link trace, zero‑code bytecode instrumentation, unified dashboards, intelligent alerts.
Monitoring 3.0 : Full‑link sampling with 60% storage cost reduction, closed‑loop metric‑trace‑log integration, advanced dashboards.
3. Overall Architecture
The architecture consists of a Prometheus cluster for metric collection, a VictoriaMetrics cluster for time‑series storage, ES/HBase for trace data, and core services for trace display and intelligent alerting. SkyWalking is used as the trace engine with extensive custom enhancements.
4. Bytecode Enhancement
HuoLala uses bytecode enhancement to achieve zero‑intrusion instrumentation. The technique modifies class bytecode at load time or before loading using Java Agent, allowing rapid deployment of monitoring code without source changes.
4.1 Java Agent
Java Agent registers a Transformer that intercepts class loading, modifies the bytecode using a framework, and returns the enhanced class to the JVM.
4.2 Bytecode Frameworks
ASM – low‑level, high learning curve.
Javassist – higher‑level, uses string‑based code injection.
ByteBuddy – highest level, supports AOP style coding and debugging.
ByteBuddy is recommended for its ease of use and debugging support.
5. Full‑Link Trace Construction
Three architecture versions are described:
Trace 1.0 : Native SkyWalking with ES storage – limited scalability.
Trace 2.0 : Separate trace service and analysis service, ES for metadata, HBase for raw data, achieving million TPS and 100 TB daily.
Trace 3.0 : Differential sampling (hot vs. cold data) reduces storage by 60% while keeping one‑hour full data.
5.1 Sampling Strategies
Standard sampling based on TraceID provides uniform reduction but cannot target valuable data. HuoLala implements fine‑grained sampling on spans (e.g., SOA >500 ms, Redis >20 ms) and full‑link sampling using Kafka delayed consumption combined with Bloom Filters to retain complete trace chains.
6. Monitoring Visualization
Dashboards display metrics (QPS, RT) with clickable links to trace details, trace pages show spans, exceptions, and logs, and topology graphs illustrate service call paths.
7. Metric‑Trace‑Log Closed‑Loop
Metrics provide APPID, name, tags, and timestamps, which are used to query trace metadata in ES, then retrieve full trace details from HBase. TraceID links logs to traces, and business tags (OrderID, UserID) are embedded in traces for cross‑service correlation.
8. Future Outlook
8.1 Root‑Cause Analysis
Automated root‑cause analysis builds expert knowledge into rule engines that recursively trace exceptions and SOA failures to the originating service.
8.2 Intelligent Alerts & Playbooks
Smart alerting combines metric, cloud, and business signals to trigger actions such as scaling recommendations or automated remediation, with ongoing work to expand scenario coverage and automation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
