Operations 42 min read

How to Build an Effective Monitoring and Alerting System for StarRocks Clusters

This guide explains how to design a comprehensive monitoring and alerting framework for StarRocks, covering resource usage, service availability, and business continuity with practical PromQL queries and troubleshooting steps.

StarRocks

Apr 22, 2025

This guide provides a comprehensive monitoring and alerting solution for StarRocks clusters, focusing on resource usage, service health, and business continuity.

Key Metrics and PromQL Queries

CPU Overload (BE) :

(1-(sum(rate(starrocks_be_cpu{mode="idle",job="$cluster_name",instance=~".*"}[5m])) by (job,instance)) / (sum(rate(starrocks_be_cpu{job="$cluster_name",host=~".*"}[5m])) by (job,instance))) * 100

Trigger when > 90%.

Memory Usage (BE) :

starrocks_be_process_mem_bytes > 0.9 * node_memory_MemTotal_bytes

Disk I/O Utilization :

rate(node_disk_io_time_seconds_total{instance=~".*"}[5m]) * 100 > 90

Disk Capacity :

(SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host,path) - SUM(starrocks_be_disks_avail_capacity{job="$job"}) by (host,path)) / SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host,path) * 100 > 90

Root Filesystem Free Space :

node_filesystem_free_bytes{mountpoint="/"} /1024/1024/1024 < 5

FE Metadata Disk Free Space :

node_filesystem_free_bytes{mountpoint="${meta_path}"} /1024/1024/1024 < 10

FE Service Availability : count(up{group="fe",job="$job_name"}) < 3 BE Service Availability :

node_info{type="be_node_num",job="$job_name",state="dead"} > 1

FE JVM Heap Usage :

sum(jvm_heap_size_bytes{job="$job_name",type="used"}) * 100 / sum(jvm_heap_size_bytes{job="$job_name",type="max"}) > 90

Compaction Failures :

increase(starrocks_be_engine_requests_total{job="$job_name",status="failed",type="cumulative_compaction"}[1m]) > 3

and similar for base compaction.

Compaction Pressure :

starrocks_fe_max_tablet_compaction_score{job="$job_name",instance="$fe_leader"} > 100

Tablet Version Count : starrocks_be_max_tablet_rowset_num{job="$job_name"} > 700 Query Failure Rate :

sum by (job,instance)(starrocks_fe_query_err_rate{job="$job_name"}) * 100 > 10

QPS / Connection Spike :

abs((sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m])) - sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m] offset 1m))) / sum by (exported_job)(rate(starrocks_fe_query_total{process="FE",job="$job_name"}[3m]))) * 100 > 100

User Connection Limit :

sum(starrocks_fe_connection_total{job="$job_name"}) by (user) > 90

Query Latency P95 :

starrocks_fe_query_latency_ms{job="$job_name",quantile="0.95"} > 5000

Routine Load Lag :

sum by (job_name)(starrocks_fe_routine_load_max_lag_of_partition{job="$job_name",instance="$fe_master"}) > 300000

Write Failure Rate :

rate(starrocks_fe_txn_failed{job="$job_name",instance="$fe_master"}[5m]) * 100 > 5

Running Transactions per DB :

sum(starrocks_fe_txn_running{job="$job_name"}) by (db) > 900

Materialized View Refresh Failures :

increase(starrocks_fe_mv_refresh_total_failed_jobs[5m]) > 0

Schema Change Failures :

increase(starrocks_be_engine_requests_total{job="$job",type="schema_change",status="failed"}[1m]) > 1

Alert Handling Procedures

Identify the affected metric using Grafana panels or curl to query the /metrics endpoint of the relevant BE/FE node.

For CPU or memory overload, check show proc '/current_queries' on the FE to locate heavy queries, then use KILL query_id or adjust pipeline_dop and thread pool sizes ( thrift_server_max_worker_threads, number_tablet_writer_threads, flush_thread_num_per_store).

When disk I/O is high, use iotop, iostat, or du -h to find offending processes or files. Reduce concurrent imports or pause large batch jobs.

For BE service hangs, examine BE logs for messages such as "load tablets encounter failure" or "there is failure when scan rockdb tablet metas". If necessary, set ignore_load_tablet_failure=true in be.conf and restart the BE.

FE service issues (e.g., clock drift, BDB metadata shortage) require synchronizing system time, ensuring at least 5 GB free on the metadata disk, or adding metadata_journal_skip_bad_journal_ids to skip corrupt journals.

Compaction failures: locate failing tablets via BE logs ( grep -E 'compaction' be.INFO), mark the replica as BAD with

ADMIN SET REPLICA STATUS PROPERTIES("tablet_id"="$tablet_id","backend_id"="$backend_id","status"="bad")

, or increase tablet_max_versions if version count is high.

High query latency or error rate: identify problematic queries via show proc '/current_queries' or audit tables (

SELECT stmt FROM starrocks_audit_db__.starrocks_audit_tbl__ WHERE state='ERR'

). Consider adding the query to the SQL blacklist or increasing FE JVM heap.

Routine Load lag: verify job state with show routine load from $db. If jobs are PAUSED, inspect ReasonOfStateChanged and TrackingSQL. Increase desired_concurrent_number or Kafka partitions to match workload.

Materialized view refresh failures: query information_schema.materialized_views for inactive views, manually refresh with REFRESH MATERIALIZED VIEW $mv_name, or set the view ACTIVE.

Schema change memory limits: adjust memory_limitation_per_thread_for_schema_change in BE configuration (default 2 GB) and restart the BE.

Emergency Actions

Immediately reduce incoming traffic and restart overloaded FE or BE nodes.

For persistent resource saturation, scale out the cluster by adding BE nodes or increasing disk capacity.

When alerts indicate critical failures (e.g., FE node count below quorum, BE dead nodes > 1), isolate the affected node from load balancers before restarting.

References

StarRocks Monitoring Documentation: https://docs.starrocks.io/zh/docs/administration/management/monitoring/Monitor_and_Alert/

Audit Loader Plugin: https://docs.starrocks.io/zh/docs/administration/management/audit_loader/

FlameGraph Tool for profiling: https://github.com/brendangregg/FlameGraph

Performance StarRocks alerting

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.