Backend Development 13 min read

How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring

This article details Ctrip's internal Dashboard monitoring platform, explains why its HBase‑based TSDB became a bottleneck, and describes the step‑by‑step migration to a hybrid VictoriaMetrics‑ClickHouse solution with upgraded writers, unified query APIs, performance gains, and future roadmap.

ITPUB

Aug 29, 2022

How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring

Background Overview

Ctrip’s internally developed Dashboard is a long‑standing enterprise‑grade metrics monitoring product that stores custom metrics, provides real‑time analysis, and visualizes time‑series data. Originally it relied on HBase to implement a TSDB because mature products like Prometheus or InfluxDB were unavailable at the time.

As the All‑in‑One monitoring suite emerged, the limitations of the HBase‑based storage became evident, prompting a redesign of the storage layer and the introduction of a unified query API.

Overall Architecture

Dashboard consists of six components: dashboard-engine (query engine), dashboard-gateway (user query interface), dashboard-writer (writes data to HBase), dashboard-collector (Netty‑based metrics collector), dashboard-agent (client SDK supporting sum, avg, max, min), and dashboard-HBase (HBase‑backed storage). The system ingests up to 600 million rows per minute.

Current Problems

Slow query performance compared with dedicated TSDBs.

HBase hotspot issues affecting write throughput.

Heavy operational overhead of the HBase/HDFS stack.

Proprietary protocol that does not support standard Prometheus APIs, hindering integration with the All‑in‑One platform.

Replacement Challenges

Massive write volume (600 M rows/min).

Large amount of high‑dimensional and log‑type metrics (billions of series) that are unfriendly to TSDBs.

Need for a transparent migration because many internal services depend on Dashboard.

Upgrade Plan

1.1 dashboard‑HBase → dashboard‑vm

The storage layer is switched from HBase to a hybrid of VictoriaMetrics (Prometheus‑compatible TSDB) and ClickHouse (metadata store).

VictoriaMetrics handles the majority of time‑series data with efficient PromQL queries.

ClickHouse stores metadata (measurement list, tag keys, tag values) and a small amount of log‑type data.

Local table definition:

CREATE TABLE hickwall.downsample_mtv ( `timestamp` DateTime, `metricName` String, `tagKey` String, `tagValue` String, `datasourceId` UInt8 DEFAULT 40 ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/hickwall_cluster-{shard}/downsample_mtv', '{replica}') PARTITION BY toYYYYMMDD(timestamp) ORDER BY (timestamp, metricName, tagKey) TTL timestamp + toIntervalDay(7) SETTINGS index_granularity = 8192

Distributed table definition:

CREATE TABLE hickwall.downsample_mtv__dt ( `timestamp` DateTime, `metricName` String, `tagKey` String, `tagValue` String, `datasourceId` UInt8 DEFAULT 40 ) ENGINE = Distributed(hickwall_cluster, hickwall, downsample_mtv, rand())

1.2 dashboard‑writer → dashboard‑vmwriter

Data flow changes to Kafka → vmwriter → storage. vmwriter adds:

Metadata extraction (measurements, tag keys/values) stored in ClickHouse, with real‑time updates via Redis.

Pre‑aggregation based on configuration pushed from the internal config center, generating metrics like hi_agg.{measurement}_{tag1}_{tag2}_{aggField}.

Data governance: hyper‑log‑log cardinality checks and Redis‑based tag‑value throttling to drop overly large dimensions.

High‑performance ingestion using a multi‑threaded pipeline (one Kafka consumer thread, several processing threads, several writer threads) with bucketed hashing per measurement.

Sample hashing code:

private int computeMetricNameHash(byte[] metricName) { int hash = Arrays.hashCode(metricName); hash = (hash == Integer.MIN_VALUE ? 0 : hash); return hash; } byte[] metricName = metricEvent.getName(); hash = computeMetricNameHash(metricName); buckets[Math.abs(hash) % bucketCount].add(metricEvent);

Typical end‑to‑end write latency stays under one second, often in the hundreds of milliseconds.

1.3 Unified Metrics Query Layer

The new query service is compatible with the original Dashboard protocol and also supports standard Prometheus queries. It wraps VictoriaMetrics and ClickHouse, offering four core APIs:

Data API – time‑series data from VictoriaMetrics.

Measurement API – list of measurements from ClickHouse.

Measurement‑tagKey API – tag keys for a measurement.

Measurement‑tagKey‑tagValue API – tag values for a measurement.

The architecture remains modular: each data‑center runs its own storage cluster, while the unified query layer aggregates results across sites, ensuring that a single‑site failure only affects that site’s data.

Before‑After Performance Comparison

Query latency improved roughly fourfold, with most queries now completing in 10‑50 ms versus frequent timeouts on HBase.

Write stability resolved hotspot‑induced back‑pressure.

Support for advanced PromQL features such as arithmetic, rate, and fuzzy matching.

Future Plans

Extend the unified query layer to cover all internal metrics sources (e.g., HickWall, Cat) currently accessed via OpenResty + VictoriaMetrics.

Develop a unified ingestion layer to standardize write logic across the billions of metrics per second.

Apply the same unified storage principles to log data, creating a consistent architecture across metrics and logs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

metrics ClickHouse HBase TSDB VictoriaMetrics

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.