How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study
Facing hundreds of terabytes of daily monitoring data, Huawei Cloud SRE replaced HBase with the open‑source time‑series database openGemini, conducting extensive write and query performance tests that demonstrated linear scaling, superior query speed, and significant reductions in storage, CPU, and memory usage.
Background
IT operations originated in the early information‑technology era, using centralized, siloed architectures. Traditional human‑centric operations managed single devices or applications, generating limited monitoring data.
Challenge in the Cloud Era
With cloud computing, monitoring data exploded to hundreds of terabytes per day, making data processing a major difficulty for Huawei Cloud SRE.
Huawei Cloud SRE Monitoring System
The platform monitors resources across global regions, including network, storage, compute, security, and various cloud services.
Limitations of HBase
High‑level aggregation queries perform poorly and are slow over large time ranges, preventing chart rendering.
No built‑in compression algorithm, leading to high storage costs for massive daily data.
Deployment depends on third‑party components HDFS and Zookeeper, increasing operational overhead.
Time‑Series Database Evaluation
We screened open‑source TSDBs (InfluxDB, OpenTSDB, Prometheus, Druid). InfluxDB was single‑node only, OpenTSDB still relied on HBase, and Prometheus was unsuitable for large‑scale data. Druid met most needs but showed higher latency for spatio‑temporal queries. openGemini stood out for its compression efficiency and read/write performance.
Performance Testing
Write Performance
The results show linear scaling from 4U to 32U: write throughput grows from 1.55 M metrics/s to 5.6 M metrics/s.
Query Performance
Using JMeter, we tested three common query types (exact, time‑aggregation, spatio‑temporal) with varying concurrency, time ranges, and aggregation operators.
Exact query performance:
Time‑aggregation query performance:
Spatio‑temporal query performance:
Test Conclusions
Overall, openGemini outperforms Druid in all three query scenarios, meets the write throughput of comparable HBase clusters with far fewer nodes, requires no third‑party components, and provides rich monitoring metrics for faster issue diagnosis.
Migration Strategy
Dual Write
During transition, both HBase and openGemini receive writes to ensure rapid fallback if problems arise and to avoid the high cost of building a migration tool.
Query Switching
Separate DNS entries for openGemini and HBase enable seamless query routing without affecting production reliability.
Actual Impact
Since migration, all global traffic has been routed to openGemini, running stably for over six months. Compared with HBase:
Cluster size reduced by >60% (from hundreds to dozens of nodes).
Write throughput reaches 181 M points/s; storage saved >90%, CPU saved 68%, memory saved 50%.
Query performance dramatically improved.
Conclusion
Designed for time‑series data, openGemini effectively handles massive monitoring workloads, lowering operational costs and improving efficiency in cloud operations.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
