Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless
At the 9th Global Operations Conference, Huawei Cloud’s chief architect Cai Xiaogang presented a three‑pronged AIOps strategy that combines large‑scale Kubernetes management, causal tracing in Serverless environments, multi‑source RCA analysis, and clustering‑based black‑box network packet inspection, showcasing how academia‑industry collaboration accelerates cloud‑native operations.
At the 9th Global Operations Conference (GOPS), Huawei Cloud’s chief architect Cai Xiaogang delivered a talk titled “Huawei’s Three‑in‑One Exploration of AIOps Key Technologies”. He described how a triple model of industry, academia, and research drives the exploration of cloud‑management platform technologies, covering large‑scale Kubernetes container cluster control, causal sequence tracing in Serverless environments, multi‑source root‑cause analysis, and clustering‑based black‑box network packet analysis.
Large‑scale K8s Container Cluster Management
Huawei has successfully validated control of millions of containers in a test environment. The heterogeneity of compute resources, network virtualization, diverse cluster types, and rapid scaling, combined with varied customer application stacks (micro‑services, Serverless, foundational services), create significant complexity. To meet both platform‑level operations and customer‑application operational demands, Huawei designed an Inventory modeling approach that provides CMDB and OSLC capabilities, mapping infrastructure to applications and enabling cross‑resource, cross‑layer correlation.
Auto‑Scaling with Machine Learning
Beyond predefined scaling rules, Huawei Cloud’s operation services implement machine‑learning‑driven Auto‑Scaling, offering smarter resource allocation for large‑scale applications and reducing customer cost overhead.
Application Operation Management (AOM) and Application Performance Management (APM)
Huawei Cloud offers two major operation services—AOM and APM—that deliver end‑to‑end performance insight for complex cloud applications. Innovations include intelligent AutoScaling, Serverless call tracing, AI‑based anomaly detection, RCA analysis, and clustering‑based Blackbox analysis, all enhancing AIOps capabilities. Integrated with the Cloud Performance Testing Service (CPTS) and big‑data intelligent analysis, these services provide out‑of‑the‑box data collection, online perception, anomaly alerts, topology mapping, and call‑chain analysis, addressing performance degradation challenges in massive cloud deployments.
Causal Sequence Tracing in Serverless Environments
Serverless abstracts away infrastructure, requiring new performance‑tracking mechanisms. Huawei collaborated with a professor from the University of California to extend the Go‑based logging system Chariots (named GoChariots), which orders logs by causal order before recording. This non‑intrusive approach supports cross‑cloud causal tracing, operates in replication mode to reduce communication overhead, and uses an HTTP‑POST SDK independent of the function’s programming language. Additionally, Huawei developed GammaRay for AWS Lambda, extending AWS X‑Ray with causal‑order tracking based on the open‑source AWS Instrument SDK for Python.
Multi‑Source Data Root Cause Analysis (RCA)
Root‑cause analysis remains a critical challenge in complex systems. Huawei Cloud employs dynamic thresholds derived from time‑series analysis (e.g., ARIMA) for anomaly detection, and leverages APM topology and transaction analysis to pinpoint performance bottlenecks. In collaboration with European and U.S. universities, Huawei applies machine‑learning models such as Hidden Markov Models to call‑chain data, integrating Inventory, topology, and call‑chain information to map event dependencies and reveal fault propagation chains. Ongoing research explores unsupervised ML for real‑time stream correlation and alerting.
Clustering‑Based Blackbox Network Packet Analysis
Beyond intrusive tracing, Huawei developed a non‑intrusive data‑collection tool called vProbe, which passively listens to major application protocols and captures performance data without exposing business or privacy information (with anonymization when needed). Using hierarchical clustering, vProbe infers causal paths between services, achieving 90‑95% accuracy comparable to white‑box methods, sufficient for overall performance awareness, bottleneck identification, and timely alerting.
Huawei believes that traditional manual operations are no longer viable; DevOps, AIOps, and NoOps represent the inevitable evolution of cloud computing and its operational practices.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.