Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus
This article explores how high cardinality in Prometheus metrics impacts AI large‑model and autonomous‑driving observability, explains the underlying concepts, outlines the performance and cost challenges, and presents practical design, collection, and query‑side solutions—including metric modeling, pre‑aggregation, and remote‑read pushdown—to keep monitoring efficient and scalable.
Why Observability Matters for AI
AI technologies such as large language models (LLMs) and autonomous driving are rapidly advancing, becoming key drivers of innovation and efficiency. As these AI systems proliferate, observability techniques are increasingly critical for ensuring performance and reliability.
Prometheus: The De‑Facto Cloud‑Native Monitoring Standard
Prometheus, originally created by SoundCloud in 2012 and donated to the CNCF in 2016, is now the standard for cloud‑native monitoring. It is widely used for observability tasks in AI large‑model and autonomous‑driving workloads.
What Is Cardinality?
Cardinality describes the number of distinct label combinations that a metric can have. For example, the Prometheus metric http_request_total with labels app, service, and endpoint can produce a large number of unique series when multiplied across clusters, services, endpoints, methods, and response codes. In the example, the cardinality equals 2 * 5 * 200 * 2 * 5 = 20000.
Impact and Causes of High Cardinality
Increased monitoring system cost : More series mean larger indexes, caches, and higher CPU, memory, and storage consumption.
Slower read/write latency : Index creation and larger result sets extend query and ingestion times.
Quota exhaustion : Multi‑tenant environments may hit per‑tenant metric quotas more quickly.
High cardinality often stems from a large number of Prometheus targets, services exposing many time series, or labels with high churn rates (e.g., user IDs, URLs). In AI domains, the frequent creation of pod names for short‑lived training jobs is a typical source.
Common Solutions to High Cardinality
Metric Design
Reasonable modeling : Use labels that reflect meaningful dimensions; avoid overly granular identifiers when logs or traces are more appropriate.
Metric decomposition : Split metrics into separate series instead of packing all labels into one.
Metric lifecycle management : Align metric lifespan with the underlying resource and clean up stale series.
Metric switches : Enable or disable optional metrics (e.g., node‑exporter flags).
On‑Demand Choices
Selective enabling : Turn off unneeded metrics at the exporter level.
Write‑time discard : Use relabel configurations to drop unwanted series before ingestion. ( Prometheus relabel docs )
Metric Analysis
Tools like Grafana and Prometheus UI help identify high‑cardinality series. VMP also offers expert analysis services.
Solutions Specific to Large‑Model & Autonomous‑Driving Workloads
These domains share high‑cardinality challenges, often due to pod‑name churn and massive numbers of model serving endpoints. The following techniques are effective:
Write‑Side Pre‑Aggregation
VMP’s collection component can aggregate metrics (e.g., combine pod‑level CPU usage into node‑pool or task‑level aggregates) before storage, reducing series count.
Aggregation‑Pushdown Queries
When a PromQL query contains aggregations (sum, count, avg), VMP rewrites the query AST to push aggregation down to each workspace, merging results locally. If no pushable operators exist, the system falls back to RemoteRead, fetching raw series.
RemoteRead ( link ) and Thanos distributed query execution ( link ) are leveraged for cross‑instance queries.
Cross‑Cloud & Cross‑Region Aggregation
Aggregated queries also help in multi‑cloud or multi‑region scenarios, providing a unified view while minimizing data transfer.
Conclusion
The Volcano Engine Observability team shares these practices to address high cardinality in AI large‑model and autonomous‑driving monitoring. By applying thoughtful metric design, pre‑aggregation, and query‑side pushdown, organizations can maintain performant, cost‑effective observability at scale.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
