How HiTSDB’s New Streaming Aggregation Engine Boosts Query Speed 10×
This article examines the architectural redesign of Alibaba's High‑Performance Time Series Database (HiTSDB), covering storage model changes, inverted‑index enhancements, a pipelined streaming aggregation engine, data‑migration strategies, and performance benchmarks that together deliver over tenfold query speed improvements.
Introduction
HiTSDB (High‑Performance Time Series Database) is a low‑cost, highly reliable online time‑series service used extensively within Alibaba for monitoring and IoT workloads, supporting massive traffic such as the 2016 and 2017 Double‑11 events.
Background
While HiTSDB’s original engine was heavily optimized for internal Alibaba services, many of these optimizations proved difficult to apply in a public‑cloud environment. Aggregation queries often caused stack overflows, OOM errors, or complete query stalls due to architectural flaws in the original aggregation engine.
Upgrade Overview
The development team decided to revamp the engine around five key areas: storage model redesign, index upgrades, a brand‑new streaming aggregation engine, data migration, and performance evaluation. The focus of this article is the streaming aggregation component.
1. Time‑Series Storage Model
Typical time‑series data consists of a timestamp dimension and a “timeline” dimension (metric + tags). Two common storage strategies exist:
Partition by time‑window, storing consecutive points of the same natural window adjacently (used by InfluxDB, Prometheus, etc.). This approach suffers from out‑of‑order handling and can cause write performance penalties for late data.
Partition by timeline, placing all points of the same metric‑tag combination together. HiTSDB adopts this method, storing rows in HBase with a row‑key composed of metric, tags, and a natural window. This yields very efficient low‑dimensional queries, while higher‑dimensional queries use a streaming scan.
Hot‑spot problems arise when high‑frequency metrics (e.g., cpu.usage collected every second) concentrate writes on a few storage shards, causing severe data skew. Bucket‑based sharding (splitting by metric, business name, and host) mitigates this issue.
2. Inverted Index
Multi‑dimensional timelines require an inverted index to support fast queries. Posting lists can become extremely long for high‑cardinality tags, leading to memory bloat. Compression (delta encoding) and string dictionary techniques are employed to reduce memory usage.
3. Streaming Aggregation Engine
The original engine used a materialization execution mode that read all relevant time‑series into memory before aggregation, causing heap explosions for large time ranges. It also tightly coupled query, filter, interpolation, and aggregation logic, making extensions difficult.
3.1 New Execution Model
A pipeline (Volcano/Iterator) execution model is introduced. Queries are broken into a DAG of operators, each implementing Open, Next, and Close. Operators include:
ScanOp – asynchronous HBase reads.
DsAggOp – down‑sampling and interpolation.
AggOp – group aggregation (PipeAggOp, MTAggOp).
RateOp – compute rate of change.
This design enables clear interfaces, independent optimization of operators, and easy addition of new operators (e.g., field‑value filters).
3.2 Operator Strategies
Two main aggregation operators are used:
PipeAggOp : for queries that do not require interpolation (down‑sampled with non‑null fill) and where aggregation functions support incremental computation (sum, count, avg, min, max). It processes batches of time‑series in a streaming fashion, keeping only per‑group aggregates in memory.
MTAggOp / GroupedAggOp : for queries that need interpolation. MT‑Agg falls back to materialization, while GroupedAggOp leverages pre‑sorted tag groups to limit materialized data to a single group at a time, reducing memory pressure.
3.3 Query Optimizer and Executor
The optimizer selects the appropriate operator chain based on query semantics, aiming to minimize memory consumption and maximize throughput.
4. Data Migration
The new engine stores data in a format incompatible with older versions. A migration tool converts existing points during a hot upgrade. The architecture uses concurrent “salt” partitions, each with a producer (HBase scanner) and consumer (writer). Flow control limits queue size to avoid memory pressure. A ZooKeeper‑based DataLock ensures only one node performs migration at a time. Migration is idempotent: a flag data_conversion_completed is written after successful conversion, and periodic runs exit early if the flag exists.
5. Performance Evaluation
Migration throughput reaches up to 200 k TPS (≈10 GB/hour). Query latency tests show the new engine reduces average response time from ~190 ms to ~13 ms for a 1 k‑line, 1‑hour window workload, and from ~1.8 s to ~0.18 s for a single‑line, 10‑hour window workload—an improvement of more than tenfold.
Conclusion
The redesign of HiTSDB’s aggregation engine, storage model, and indexing dramatically improves stability, write/read performance, and extensibility, positioning HiTSDB for broader commercial use.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
