How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring

Ctrip’s internal Dashboard monitoring platform, originally built on HBase, was redesigned by migrating its core writer and storage components to a hybrid VictoriaMetrics‑ClickHouse solution, delivering faster queries, higher write stability, and full Prometheus compatibility while keeping the user experience unchanged.

dbaplus Community
dbaplus Community
dbaplus Community
How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring

Dashboard is a long‑standing, internally developed Ctrip monitoring product targeting enterprise‑level metrics use cases. It originally stored time‑series data in HBase, implementing a custom TSDB layer on top of HBase to handle high‑volume writes (up to 600 million rows per minute) and provide real‑time analysis panels.

Background

When the system was first built, modern TSDB products such as Prometheus and InfluxDB were not yet mature, so HBase was chosen for its scalability. Over time, the All‑in‑One monitoring initiative raised new requirements for unified metrics storage and query APIs, exposing several shortcomings of the HBase‑based design.

Existing Architecture

The Dashboard system consists of six components:

dashboard‑engine – query engine

dashboard‑gateway – user‑facing query interface

dashboard‑writer – writes data to HBase

dashboard‑collector – Netty‑based metrics collector

dashboard‑agent – client SDK supporting sum, avg, max, min aggregations

dashboard‑HBase – HBase‑backed metrics store

Key features include minute‑level time‑series storage, multi‑tag support per metric, and flexible view rendering.

Problems with the HBase Solution

Slow TSDB‑specific queries compared with purpose‑built TSDBs.

HBase hotspot issues affecting write throughput.

Heavy operational overhead for the HBase stack.

Proprietary protocol incompatible with standard Prometheus, hindering integration with other monitoring tools.

Migration Challenges

Handling 600 M rows/minute write volume.

Governance of billions of high‑cardinality metrics and log‑type data.

Ensuring a transparent migration for existing users and programs.

Replacement Strategy

The migration focuses on two core components: dashboard‑writer and dashboard‑HBase . Other components remain unchanged, and the dashboard‑engine is only updated to call the new query API, preserving a seamless user experience.

1. Replace HBase with a VictoriaMetrics + ClickHouse hybrid store

VictoriaMetrics provides Prometheus‑compatible TSDB capabilities with superior query performance.

ClickHouse stores metadata (measurement lists, tag‑key/value mappings) and a small amount of log‑type data that does not fit well in VictoriaMetrics.

Local ClickHouse table schema:

CREATE TABLE hickwall.downsample_mtv (
  `timestamp` DateTime,
  `metricName` String,
  `tagKey` String,
  `tagValue` String,
  `datasourceId` UInt8 DEFAULT 40
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/hickwall_cluster-{shard}/downsample_mtv', '{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, metricName, tagKey)
TTL timestamp + toIntervalDay(7)
SETTINGS index_granularity = 8192

Distributed table definition:

CREATE TABLE hickwall.downsample_mtv__dt (
  `timestamp` DateTime,
  `metricName` String,
  `tagKey` String,
  `tagValue` String,
  `datasourceId` UInt8 DEFAULT 40
) ENGINE = Distributed(hickwall_cluster, hickwall, downsample_mtv, rand())

2. Upgrade dashboard‑writer to dashboard‑vmwriter

Dashboard‑collector streams raw data to Kafka; dashboard‑vmwriter consumes Kafka, performs processing, and writes to the new stores. Core functions added:

Metadata extraction (measurement, tagKey, tagValue) stored in ClickHouse and cached in Redis.

Pre‑aggregation of metrics based on configuration pushed from an internal config center, reducing query scan cost.

Data governance: automatic detection and dropping of ultra‑high‑cardinality tags using HyperLogLog and Redis set cardinality checks.

High‑performance write pipeline with one Kafka consumer thread, multiple processing threads, and multiple writer threads, using hash‑based bucket assignment per measurement.

Example hash function used for bucket selection:

private int computeMetricNameHash(byte[] metricName) {
    int hash = Arrays.hashCode(metricName);
    hash = (hash == Integer.MIN_VALUE ? 0 : hash);
    return hash;
}
byte[] metricName = metricEvent.getName();
hash = computeMetricNameHash(metricName);
 bBuckets[Math.abs(hash) % bucketCount].add(metricEvent);

Typical write latency stays under 1 second, often in the hundreds of milliseconds.

3. Unified Metrics Query Layer

The new query service is compatible with the original Dashboard protocol and also supports standard Prometheus queries. It aggregates data from VictoriaMetrics (raw series) and ClickHouse (metadata, pre‑aggregated series). Four main APIs are exposed:

Data – returns time‑series data for a given measurement and tag set (source: VictoriaMetrics).

Measurement – lists available measurements (source: ClickHouse).

Measurement‑tagKey – lists tag keys for a measurement (source: ClickHouse).

Measurement‑tagKey‑tagValue – lists tag values for a measurement‑key pair (source: ClickHouse).

Results

Query latency improved roughly fourfold, with most queries completing in 10‑50 ms, eliminating frequent HBase timeouts.

Write stability dramatically increased, fully resolving HBase hotspot‑induced back‑pressure.

New features such as PromQL‑based calculations, period‑over‑period comparisons, and fuzzy matching became available.

Future Plans

Extend the unified query layer to other internal metrics platforms (e.g., HickWall, Cat).

Develop a unified write layer to consolidate the many existing Kafka‑to‑store pipelines.

Apply the same storage‑unification principles to log storage systems.

The migration demonstrates a practical path for modernizing legacy TSDB infrastructures by combining a high‑performance Prometheus‑compatible engine with a columnar store for metadata and auxiliary data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringmetricsClickHouseHBaseDashboardTSDBVictoriaMetrics
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.