How to Build an Effective Monitoring and Alerting System for StarRocks Clusters
This guide explains how to design a comprehensive monitoring and alerting framework for StarRocks, covering resource usage, service availability, and business continuity with practical PromQL queries and troubleshooting steps.
This guide provides a comprehensive monitoring and alerting solution for StarRocks clusters, focusing on resource usage, service health, and business continuity.
Key Metrics and PromQL Queries
CPU Overload (BE) :
(1-(sum(rate(starrocks_be_cpu{mode="idle",job="$cluster_name",instance=~".*"}[5m])) by (job,instance)) / (sum(rate(starrocks_be_cpu{job="$cluster_name",host=~".*"}[5m])) by (job,instance))) * 100Trigger when > 90%.
Memory Usage (BE) :
starrocks_be_process_mem_bytes > 0.9 * node_memory_MemTotal_bytesDisk I/O Utilization :
rate(node_disk_io_time_seconds_total{instance=~".*"}[5m]) * 100 > 90Disk Capacity :
(SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host,path) - SUM(starrocks_be_disks_avail_capacity{job="$job"}) by (host,path)) / SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host,path) * 100 > 90Root Filesystem Free Space :
node_filesystem_free_bytes{mountpoint="/"} /1024/1024/1024 < 5FE Metadata Disk Free Space :
node_filesystem_free_bytes{mountpoint="${meta_path}"} /1024/1024/1024 < 10FE Service Availability : count(up{group="fe",job="$job_name"}) < 3 BE Service Availability :
node_info{type="be_node_num",job="$job_name",state="dead"} > 1FE JVM Heap Usage :
sum(jvm_heap_size_bytes{job="$job_name",type="used"}) * 100 / sum(jvm_heap_size_bytes{job="$job_name",type="max"}) > 90Compaction Failures :
increase(starrocks_be_engine_requests_total{job="$job_name",status="failed",type="cumulative_compaction"}[1m]) > 3and similar for base compaction.
Compaction Pressure :
starrocks_fe_max_tablet_compaction_score{job="$job_name",instance="$fe_leader"} > 100Tablet Version Count : starrocks_be_max_tablet_rowset_num{job="$job_name"} > 700 Query Failure Rate :
sum by (job,instance)(starrocks_fe_query_err_rate{job="$job_name"}) * 100 > 10QPS / Connection Spike :
abs((sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m])) - sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m] offset 1m))) / sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m]))) * 100 > 100User Connection Limit :
sum(starrocks_fe_connection_total{job="$job_name"}) by (user) > 90Query Latency P95 :
starrocks_fe_query_latency_ms{job="$job_name",quantile="0.95"} > 5000Routine Load Lag :
sum by (job_name)(starrocks_fe_routine_load_max_lag_of_partition{job="$job_name",instance="$fe_master"}) > 300000Write Failure Rate :
rate(starrocks_fe_txn_failed{job="$job_name",instance="$fe_master"}[5m]) * 100 > 5Running Transactions per DB :
sum(starrocks_fe_txn_running{job="$job_name"}) by (db) > 900Materialized View Refresh Failures :
increase(starrocks_fe_mv_refresh_total_failed_jobs[5m]) > 0Schema Change Failures :
increase(starrocks_be_engine_requests_total{job="$job",type="schema_change",status="failed"}[1m]) > 1Alert Handling Procedures
Identify the affected metric using Grafana panels or curl to query the /metrics endpoint of the relevant BE/FE node.
For CPU or memory overload, check show proc '/current_queries' on the FE to locate heavy queries, then use KILL query_id or adjust pipeline_dop and thread pool sizes ( thrift_server_max_worker_threads, number_tablet_writer_threads, flush_thread_num_per_store).
When disk I/O is high, use iotop, iostat, or du -h to find offending processes or files. Reduce concurrent imports or pause large batch jobs.
For BE service hangs, examine BE logs for messages such as "load tablets encounter failure" or "there is failure when scan rockdb tablet metas". If necessary, set ignore_load_tablet_failure=true in be.conf and restart the BE.
FE service issues (e.g., clock drift, BDB metadata shortage) require synchronizing system time, ensuring at least 5 GB free on the metadata disk, or adding metadata_journal_skip_bad_journal_ids to skip corrupt journals.
Compaction failures: locate failing tablets via BE logs ( grep -E 'compaction' be.INFO), mark the replica as BAD with
ADMIN SET REPLICA STATUS PROPERTIES("tablet_id"="$tablet_id","backend_id"="$backend_id","status"="bad"), or increase tablet_max_versions if version count is high.
High query latency or error rate: identify problematic queries via show proc '/current_queries' or audit tables (
SELECT stmt FROM starrocks_audit_db__.starrocks_audit_tbl__ WHERE state='ERR'). Consider adding the query to the SQL blacklist or increasing FE JVM heap.
Routine Load lag: verify job state with show routine load from $db. If jobs are PAUSED, inspect ReasonOfStateChanged and TrackingSQL. Increase desired_concurrent_number or Kafka partitions to match workload.
Materialized view refresh failures: query information_schema.materialized_views for inactive views, manually refresh with REFRESH MATERIALIZED VIEW $mv_name, or set the view ACTIVE.
Schema change memory limits: adjust memory_limitation_per_thread_for_schema_change in BE configuration (default 2 GB) and restart the BE.
Emergency Actions
Immediately reduce incoming traffic and restart overloaded FE or BE nodes.
For persistent resource saturation, scale out the cluster by adding BE nodes or increasing disk capacity.
When alerts indicate critical failures (e.g., FE node count below quorum, BE dead nodes > 1), isolate the affected node from load balancers before restarting.
References
StarRocks Monitoring Documentation: https://docs.starrocks.io/zh/docs/administration/management/monitoring/Monitor_and_Alert/
Audit Loader Plugin: https://docs.starrocks.io/zh/docs/administration/management/audit_loader/
FlameGraph Tool for profiling: https://github.com/brendangregg/FlameGraph
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
