Cloud Native 17 min read

How SLS Boosted Prometheus Query Performance Over 10× with Cloud‑Native Innovations

This article details the recent technical upgrades to Alibaba Cloud's SLS Prometheus storage engine, describing how compatibility with PromQL was retained while achieving more than tenfold query speed improvements, reducing costs through smarter aggregation writes, built‑in downsampling, global caching, parallel computation, and push‑down processing, and presenting benchmark comparisons with open‑source solutions.

Alibaba Cloud Developer

Nov 8, 2023

How SLS Boosted Prometheus Query Performance Over 10× with Cloud‑Native Innovations

Technical Challenges

Observability has become a hot topic, with many startups and established monitoring, APM, and logging vendors offering unified log, metric, and trace capabilities. The core of data innovation is a stable, powerful, and cost‑effective storage‑compute engine.

Unlike industry solutions that use separate stacks (e.g., Elasticsearch for logs, ClickHouse for traces, Prometheus for metrics), SLS designs a unified architecture at the data engine level, supporting all observability data in a single process. Since 2018, SLS added PromQL support on top of its log model and later redesigned the storage engine to store logs, traces, and metrics together.

SLS Time‑Series Engine Technical Upgrades

Smarter Aggregation Writes

By aggregating data on the same time series into a single shard, both storage and read efficiency improve. Initially, users had to aggregate on the client side using a Producer and specify a shard hash key, which required high client resources. The new gateway‑side aggregation allows clients to write data randomly to a gateway node, which then automatically aggregates data into the appropriate shard, reducing client overhead.

Global Cache for Dashboards

Dashboard queries, especially during load testing or incidents, generate many simultaneous requests. By aligning PromQL query ranges to the step size, cached results can be reused across requests, dramatically increasing cache hit rates.

Requests entering any compute node adjust the range based on the step if cache is enabled.

The adjusted range is used to query the SLS Cache Server.

Misses are fetched from the backend and computed.

Results are returned to the client and the cache is updated with the incremental computation.

Only the range alignment differs from standard PromQL behavior, with negligible impact on result trends.

Distributed Parallel Computation of PromQL

Standard Prometheus computes queries on a single thread, which becomes a bottleneck at large scales. SLS introduces a parallel architecture where a master node splits the query, workers execute sub‑queries, and the master aggregates the final result, decoupling concurrency from shard count.

Not all queries benefit from parallelism, but over 90% of production queries see acceleration.

Push‑Down Computation for Extreme Scale

To reduce serialization and network overhead, SLS pushes part of the PromQL computation to the storage shards. Two approaches were evaluated: using the standard Go engine (still incurs serialization) and implementing common operators in C++ to avoid Go GC and serialization costs. The C++ engine was chosen due to its low implementation effort for the most common queries, delivering over tenfold performance gains at massive scales.

Built‑In Downsampling

Long‑term storage of high‑precision metrics is costly. Previously, downsampling required manual ScheduledSQL jobs and query rewrites. SLS now offers native downsampling: users configure the downsampling interval and retention, SLS automatically stores the latest point per interval, and queries automatically select the appropriate metric store based on step and time range, without manual query changes.

UnionMetricStore for Cross‑Project Queries

To enable high‑performance queries across multiple projects or regions, SLS introduces UnionMetricStore, which supports full PromQL and can aggregate data from several MetricStores within the same account.

Performance Benchmarks

SLS was benchmarked against open‑source Prometheus, Thanos, and VictoriaMetrics across three scales (4 k, 360 k, 1.28 M time series). Variants tested: sls‑normal (single‑threaded), sls‑parallel‑32 (32‑way parallel), and sls‑pushdown (C++ engine). Results show sls‑pushdown achieving >10× lower latency at the largest scale, while VictoriaMetrics performs well on smaller datasets. Detailed reports will follow.

Cost Reductions

Two major cost‑saving mechanisms are highlighted:

Aggregated writes reduce storage and compute resources per unit of data.

Built‑in downsampling lowers long‑term storage volume, decreasing per‑GB pricing.

These improvements translate into lower SLS time‑series pricing for customers.

Conclusion

The technical upgrades described—smarter aggregation, global caching, parallel and push‑down computation, native downsampling, and UnionMetricStore—significantly boost performance, reduce cost, and enhance stability for cloud‑native observability workloads. Further documentation, best‑practice guides, and detailed test reports will be released gradually.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native prometheus time series

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.