How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring
Ctrip’s internal Dashboard monitoring platform, originally built on HBase, was redesigned by migrating its core writer and storage components to a hybrid VictoriaMetrics‑ClickHouse solution, delivering faster queries, higher write stability, and full Prometheus compatibility while keeping the user experience unchanged.
Dashboard is a long‑standing, internally developed Ctrip monitoring product targeting enterprise‑level metrics use cases. It originally stored time‑series data in HBase, implementing a custom TSDB layer on top of HBase to handle high‑volume writes (up to 600 million rows per minute) and provide real‑time analysis panels.
Background
When the system was first built, modern TSDB products such as Prometheus and InfluxDB were not yet mature, so HBase was chosen for its scalability. Over time, the All‑in‑One monitoring initiative raised new requirements for unified metrics storage and query APIs, exposing several shortcomings of the HBase‑based design.
Existing Architecture
The Dashboard system consists of six components:
dashboard‑engine – query engine
dashboard‑gateway – user‑facing query interface
dashboard‑writer – writes data to HBase
dashboard‑collector – Netty‑based metrics collector
dashboard‑agent – client SDK supporting sum, avg, max, min aggregations
dashboard‑HBase – HBase‑backed metrics store
Key features include minute‑level time‑series storage, multi‑tag support per metric, and flexible view rendering.
Problems with the HBase Solution
Slow TSDB‑specific queries compared with purpose‑built TSDBs.
HBase hotspot issues affecting write throughput.
Heavy operational overhead for the HBase stack.
Proprietary protocol incompatible with standard Prometheus, hindering integration with other monitoring tools.
Migration Challenges
Handling 600 M rows/minute write volume.
Governance of billions of high‑cardinality metrics and log‑type data.
Ensuring a transparent migration for existing users and programs.
Replacement Strategy
The migration focuses on two core components: dashboard‑writer and dashboard‑HBase . Other components remain unchanged, and the dashboard‑engine is only updated to call the new query API, preserving a seamless user experience.
1. Replace HBase with a VictoriaMetrics + ClickHouse hybrid store
VictoriaMetrics provides Prometheus‑compatible TSDB capabilities with superior query performance.
ClickHouse stores metadata (measurement lists, tag‑key/value mappings) and a small amount of log‑type data that does not fit well in VictoriaMetrics.
Local ClickHouse table schema:
CREATE TABLE hickwall.downsample_mtv (
`timestamp` DateTime,
`metricName` String,
`tagKey` String,
`tagValue` String,
`datasourceId` UInt8 DEFAULT 40
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/hickwall_cluster-{shard}/downsample_mtv', '{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, metricName, tagKey)
TTL timestamp + toIntervalDay(7)
SETTINGS index_granularity = 8192Distributed table definition:
CREATE TABLE hickwall.downsample_mtv__dt (
`timestamp` DateTime,
`metricName` String,
`tagKey` String,
`tagValue` String,
`datasourceId` UInt8 DEFAULT 40
) ENGINE = Distributed(hickwall_cluster, hickwall, downsample_mtv, rand())2. Upgrade dashboard‑writer to dashboard‑vmwriter
Dashboard‑collector streams raw data to Kafka; dashboard‑vmwriter consumes Kafka, performs processing, and writes to the new stores. Core functions added:
Metadata extraction (measurement, tagKey, tagValue) stored in ClickHouse and cached in Redis.
Pre‑aggregation of metrics based on configuration pushed from an internal config center, reducing query scan cost.
Data governance: automatic detection and dropping of ultra‑high‑cardinality tags using HyperLogLog and Redis set cardinality checks.
High‑performance write pipeline with one Kafka consumer thread, multiple processing threads, and multiple writer threads, using hash‑based bucket assignment per measurement.
Example hash function used for bucket selection:
private int computeMetricNameHash(byte[] metricName) {
int hash = Arrays.hashCode(metricName);
hash = (hash == Integer.MIN_VALUE ? 0 : hash);
return hash;
}
byte[] metricName = metricEvent.getName();
hash = computeMetricNameHash(metricName);
bBuckets[Math.abs(hash) % bucketCount].add(metricEvent);Typical write latency stays under 1 second, often in the hundreds of milliseconds.
3. Unified Metrics Query Layer
The new query service is compatible with the original Dashboard protocol and also supports standard Prometheus queries. It aggregates data from VictoriaMetrics (raw series) and ClickHouse (metadata, pre‑aggregated series). Four main APIs are exposed:
Data – returns time‑series data for a given measurement and tag set (source: VictoriaMetrics).
Measurement – lists available measurements (source: ClickHouse).
Measurement‑tagKey – lists tag keys for a measurement (source: ClickHouse).
Measurement‑tagKey‑tagValue – lists tag values for a measurement‑key pair (source: ClickHouse).
Results
Query latency improved roughly fourfold, with most queries completing in 10‑50 ms, eliminating frequent HBase timeouts.
Write stability dramatically increased, fully resolving HBase hotspot‑induced back‑pressure.
New features such as PromQL‑based calculations, period‑over‑period comparisons, and fuzzy matching became available.
Future Plans
Extend the unified query layer to other internal metrics platforms (e.g., HickWall, Cat).
Develop a unified write layer to consolidate the many existing Kafka‑to‑store pipelines.
Apply the same storage‑unification principles to log storage systems.
The migration demonstrates a practical path for modernizing legacy TSDB infrastructures by combining a high‑performance Prometheus‑compatible engine with a columnar store for metadata and auxiliary data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
