How SLS Boosted Prometheus Query Performance Over 10× with Cloud‑Native Innovations
This article details the recent technical upgrades to Alibaba Cloud's SLS Prometheus storage engine, describing how compatibility with PromQL was retained while achieving more than tenfold query speed improvements, reducing costs through smarter aggregation writes, built‑in downsampling, global caching, parallel computation, and push‑down processing, and presenting benchmark comparisons with open‑source solutions.
Technical Challenges
Observability has become a hot topic, with many startups and established monitoring, APM, and logging vendors offering unified log, metric, and trace capabilities. The core of data innovation is a stable, powerful, and cost‑effective storage‑compute engine.
Unlike industry solutions that use separate stacks (e.g., Elasticsearch for logs, ClickHouse for traces, Prometheus for metrics), SLS designs a unified architecture at the data engine level, supporting all observability data in a single process. Since 2018, SLS added PromQL support on top of its log model and later redesigned the storage engine to store logs, traces, and metrics together.
SLS Time‑Series Engine Technical Upgrades
Smarter Aggregation Writes
By aggregating data on the same time series into a single shard, both storage and read efficiency improve. Initially, users had to aggregate on the client side using a Producer and specify a shard hash key, which required high client resources. The new gateway‑side aggregation allows clients to write data randomly to a gateway node, which then automatically aggregates data into the appropriate shard, reducing client overhead.
Global Cache for Dashboards
Dashboard queries, especially during load testing or incidents, generate many simultaneous requests. By aligning PromQL query ranges to the step size, cached results can be reused across requests, dramatically increasing cache hit rates.
Requests entering any compute node adjust the range based on the step if cache is enabled.
The adjusted range is used to query the SLS Cache Server.
Misses are fetched from the backend and computed.
Results are returned to the client and the cache is updated with the incremental computation.
Only the range alignment differs from standard PromQL behavior, with negligible impact on result trends.
Distributed Parallel Computation of PromQL
Standard Prometheus computes queries on a single thread, which becomes a bottleneck at large scales. SLS introduces a parallel architecture where a master node splits the query, workers execute sub‑queries, and the master aggregates the final result, decoupling concurrency from shard count.
Not all queries benefit from parallelism, but over 90% of production queries see acceleration.
Push‑Down Computation for Extreme Scale
To reduce serialization and network overhead, SLS pushes part of the PromQL computation to the storage shards. Two approaches were evaluated: using the standard Go engine (still incurs serialization) and implementing common operators in C++ to avoid Go GC and serialization costs. The C++ engine was chosen due to its low implementation effort for the most common queries, delivering over tenfold performance gains at massive scales.
Built‑In Downsampling
Long‑term storage of high‑precision metrics is costly. Previously, downsampling required manual ScheduledSQL jobs and query rewrites. SLS now offers native downsampling: users configure the downsampling interval and retention, SLS automatically stores the latest point per interval, and queries automatically select the appropriate metric store based on step and time range, without manual query changes.
UnionMetricStore for Cross‑Project Queries
To enable high‑performance queries across multiple projects or regions, SLS introduces UnionMetricStore, which supports full PromQL and can aggregate data from several MetricStores within the same account.
Performance Benchmarks
SLS was benchmarked against open‑source Prometheus, Thanos, and VictoriaMetrics across three scales (4 k, 360 k, 1.28 M time series). Variants tested: sls‑normal (single‑threaded), sls‑parallel‑32 (32‑way parallel), and sls‑pushdown (C++ engine). Results show sls‑pushdown achieving >10× lower latency at the largest scale, while VictoriaMetrics performs well on smaller datasets. Detailed reports will follow.
Cost Reductions
Two major cost‑saving mechanisms are highlighted:
Aggregated writes reduce storage and compute resources per unit of data.
Built‑in downsampling lowers long‑term storage volume, decreasing per‑GB pricing.
These improvements translate into lower SLS time‑series pricing for customers.
Conclusion
The technical upgrades described—smarter aggregation, global caching, parallel and push‑down computation, native downsampling, and UnionMetricStore—significantly boost performance, reduce cost, and enhance stability for cloud‑native observability workloads. Further documentation, best‑practice guides, and detailed test reports will be released gradually.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
