How Intelligent Operations and Observability Transform Cloud‑Native Environments
In this talk, Wu Yakun from Guance Cloud explains the shortcomings of traditional operations, introduces intelligent, data‑driven approaches for the cloud‑native era, and outlines how unified data collection, observability, and SLO‑based monitoring can dramatically improve fault detection and system reliability.
Background
On August 19‑20, 2022, the GOPS Global Operations Conference was held in Shenzhen, where Guance Cloud’s chief evangelist Wu Yakun presented the topic “Intelligent Operations and Observability in the Cloud Era.”
Traditional Operations Challenges
Traditional operations relied on simple three‑tier architectures and manual “three‑hammer” fixes—restart, reinstall, replace hardware. As workloads moved to virtualization, cloud, and micro‑services, these methods became insufficient because service call chains grew longer and root‑cause identification grew harder.
Intelligent Operations in the Cloud Era
In the cloud‑native era, massive data volumes require new strategies to monitor every layer and quickly locate issues. The primary goal of monitoring is to ensure business stability by enabling rapid fault detection and recovery.
Unified Data Collection and Analysis
Effective analysis of massive data demands unified storage, correlation, and analysis. Isolated monitoring systems lead to fragmented data and high overall fault rates. By consolidating data, teams can trace a problem through its entire call chain, identify the responsible component, and resolve issues efficiently.
Observability Directions
Observability encompasses monitoring, AIOps, and deeper analysis. It involves collecting low‑density metrics (CPU, memory) alongside high‑density signals (traces, logs) and using them to drive actions. Unified data enables SLO‑driven dashboards that automatically flag violations and support KPI assessment.
Open‑Source Tools: DataKit and DataFlux Func
Guance Cloud open‑sourced the DataKit collector, which gathers data from diverse sources and forwards it to a backend platform. DataFlux Func provides a programmable open‑source sandbox for integrating cloud billing, sensor data, and custom scripts, facilitating extensive data enrichment.
SLO‑Based Monitoring
By packaging metrics into Service Level Objectives (SLOs) such as 99.9% availability, the platform can automatically deduct budgets on violations, offering clear visibility into system health and supporting performance‑based assessments.
Conclusion
The talk emphasizes that unified data collection, correlation, and analysis are essential for intelligent operations and observability, enabling faster fault isolation, cost optimization, and better alignment between development and operations.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.