Operations 24 min read

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.

Huolala Tech

Sep 22, 2022

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

Introduction

This article is based on Cao Wei's live streaming talk "Stability and High‑Availability Construction of Core Infrastructure in Multi‑Cloud Scenarios" and summarizes the key points of HuoLala's monitoring system.

1. Monitoring Evolution

The monitoring industry started with eBay's CAL (2002), followed by Google's Dapper (2010) and commercial products like Datadog. Subsequent open‑source projects include CAT (2011), Zipkin (2012), EagleEye, Pinpoint, EMONITOR (2014), SkyWalking (2015), and Jaeger (2016).

2. HuoLala Monitoring Evolution

HuoLala's monitoring history is divided into three stages:

Monitoring 1.0 : Independent Prometheus instances per team, no unified trace, low efficiency.

Monitoring 2.0 : Standardized metrics, full‑link trace, zero‑code bytecode instrumentation, unified dashboards, intelligent alerts.

Monitoring 3.0 : Full‑link sampling with 60% storage cost reduction, closed‑loop metric‑trace‑log integration, advanced dashboards.

3. Overall Architecture

The architecture consists of a Prometheus cluster for metric collection, a VictoriaMetrics cluster for time‑series storage, ES/HBase for trace data, and core services for trace display and intelligent alerting. SkyWalking is used as the trace engine with extensive custom enhancements.

4. Bytecode Enhancement

HuoLala uses bytecode enhancement to achieve zero‑intrusion instrumentation. The technique modifies class bytecode at load time or before loading using Java Agent, allowing rapid deployment of monitoring code without source changes.

4.1 Java Agent

Java Agent registers a Transformer that intercepts class loading, modifies the bytecode using a framework, and returns the enhanced class to the JVM.

4.2 Bytecode Frameworks

ASM – low‑level, high learning curve.

Javassist – higher‑level, uses string‑based code injection.

ByteBuddy – highest level, supports AOP style coding and debugging.

ByteBuddy is recommended for its ease of use and debugging support.

5. Full‑Link Trace Construction

Three architecture versions are described:

Trace 1.0 : Native SkyWalking with ES storage – limited scalability.

Trace 2.0 : Separate trace service and analysis service, ES for metadata, HBase for raw data, achieving million TPS and 100 TB daily.

Trace 3.0 : Differential sampling (hot vs. cold data) reduces storage by 60% while keeping one‑hour full data.

5.1 Sampling Strategies

Standard sampling based on TraceID provides uniform reduction but cannot target valuable data. HuoLala implements fine‑grained sampling on spans (e.g., SOA >500 ms, Redis >20 ms) and full‑link sampling using Kafka delayed consumption combined with Bloom Filters to retain complete trace chains.

6. Monitoring Visualization

Dashboards display metrics (QPS, RT) with clickable links to trace details, trace pages show spans, exceptions, and logs, and topology graphs illustrate service call paths.

7. Metric‑Trace‑Log Closed‑Loop

Metrics provide APPID, name, tags, and timestamps, which are used to query trace metadata in ES, then retrieve full trace details from HBase. TraceID links logs to traces, and business tags (OrderID, UserID) are embedded in traces for cross‑service correlation.

8. Future Outlook

8.1 Root‑Cause Analysis

Automated root‑cause analysis builds expert knowledge into rule engines that recursively trace exceptions and SOA failures to the originating service.

8.2 Intelligent Alerts & Playbooks

Smart alerting combines metric, cloud, and business signals to trigger actions such as scaling recommendations or automated remediation, with ongoing work to expand scenario coverage and automation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations bytecode Cloud Tracing high-availability

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.