How HBase Boosted Tencent Monitoring Platform Performance 3‑5×
Facing the challenge of storing over 120 billion daily monitoring points from hundreds of thousands of servers, Tencent’s monitoring platform migrated from a custom solution and OpenTSDB to a finely tuned HBase architecture, achieving 3‑5× higher throughput, improved reliability, and significant storage savings.
Introduction
Company operates hundreds of thousands of servers, and the Tencent Monitoring Platform (TMP) collects more than 1.2 trillion monitoring data points per day. This article examines the problems of the existing storage architecture and describes the practice of using HBase to store TMP monitoring data.
Background
Open‑source big‑data processing systems have matured, offering solutions for many scenarios, similar to MySQL’s role for relational data. TMP gathers minute‑level data from massive server fleets, initially trying OpenTSDB before designing a custom HBase storage solution.
Analysis of TMP Current Storage Architecture
The current architecture routes data from agents through collectors, stores it in memory caches, and periodically dumps it to the file system. While simple, horizontally scalable, and fully self‑developed, it suffers from cache failures, disk/machine failures, fixed data format and lack of compression, and reliance on external metadata services.
Advantages of HBase Storage Engine
HBase, a distributed column‑store based on the Bigtable model and LSM‑Tree engine, is widely used for massive time‑series data. Its advantages include high reliability and availability, high write performance, natural horizontal scalability, and column compression that eliminates empty columns.
OpenTSDB Attempt and Bottleneck Analysis
When testing OpenTSDB on HBase, the cluster became overloaded at about 700 k writes per second. Bottlenecks included heavy UID translation, inefficient append and compaction mechanisms, and a single table design that hindered time‑based maintenance and region management.
TMP Monitoring Storage Design Practice
Region Pre‑Splitting
To avoid hotspot regions, tables are pre‑split into 100 regions using split keys 0x01‑0x63, distributing data evenly across RegionServers.
Rowkey and Column Design
Rowkey consists of a 1‑byte salt, 4‑byte server ID, 4‑byte timestamp, and 4‑byte metric ID, ensuring uniform distribution and efficient queries. A single column family is used to reduce Memstore overhead, and qualifiers store time offsets.
Column‑Based Compaction
HBase stores data column‑wise; each column repeats rowkey and column family information, leading to storage bloat. Inspired by OpenTSDB, columns within the same time‑base are merged into a single column, reducing storage by about 90%.
HBase Performance Tuning
Key tuning points include increasing RegionServer heap and Memstore sizes, enabling Snappy compression, raising compaction thread counts (e.g., hbase.regionserver.thread.compaction.small/large = 5), and proper GC configuration to avoid stop‑the‑world pauses.
Conclusion
After the redesign, TMP’s monitoring storage achieves 3‑5× higher performance, with peak write rates of 4 million rows per second on eight RegionServers, far surpassing OpenTSDB’s 700 k limit. The system is now in production, and future work includes adding a buffering layer for pre‑compaction to further boost performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
