How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Grab replaced its fragmented Grafana‑Superset stack with a StarRocks‑backed Iris platform, achieving over ten‑fold query speedups, 40% lower resource usage, and a unified real‑time and historical data store for Spark observability across its Southeast Asian super‑app ecosystem.

StarRocks
StarRocks
StarRocks
How Grab Supercharged Spark Observability 10× with StarRocks – Inside the Iris Architecture

Grab operates a massive super‑app ecosystem spanning food delivery, rides, and digital finance across eight Southeast Asian countries. To improve the performance of its Spark monitoring platform, Iris, Grab migrated the core storage from a Telegraf‑InfluxDB‑Grafana (TIG) stack to StarRocks, consolidating real‑time and historical metrics.

Iris – Grab’s Spark Observability Platform

Iris collects job‑level metrics and metadata from Spark clusters, providing fine‑grained insight into resource usage, performance patterns, and query behavior. The original setup scattered data across Grafana (real‑time) and Superset (historical), requiring users to switch interfaces and offering limited access control for non‑technical users.

Challenges with the Legacy Architecture

Dispersed user experience and coarse‑grained access control: metrics lived in separate Grafana and Superset dashboards, making unified analysis difficult.

High operational overhead: the offline pipeline involved multiple hops and data transformations.

Data management complexity: synchronising real‑time InfluxDB data with offline data lakes and handling string‑type metadata proved cumbersome.

Architecture Redesign – From TIG to StarRocks

The new Iris architecture replaces InfluxDB with StarRocks as the unified storage layer and eliminates Telegraf by ingesting data directly from Kafka. Key components include:

StarRocks database for both real‑time and historical data, supporting complex queries.

Direct Kafka ingestion via StarRocks Routine Load, removing the Telegraf dependency.

Custom Iris UI (web app) that replaces Grafana dashboards, offering a consolidated and role‑based interface.

Superset integration retained for BI needs, now querying StarRocks directly.

Simplified offline processing: StarRocks periodically backs up data to S3, reducing the data‑lake pipeline.

Data Model and Ingestion

For each Spark cluster, Iris captures three groups of tables:

Cluster metadata (e.g., report date, platform, worker UUID, cluster ID, job ID).

Worker metrics (CPU cores, memory, heap usage, etc.).

Spark job metrics (read/write counts, bytes, task numbers, etc.).

Data is streamed from Kafka into StarRocks using the Routine Load feature, which continuously reads from specified topics, parses JSON payloads, and monitors load status via built‑in queries.

Unified Real‑Time & Historical Data Processing

Real‑time ingestion: Routine Load brings data from Kafka into Iris tables within seconds, ensuring up‑to‑date monitoring.

Historical storage: StarRocks retains metadata and metrics for over 30 days (TTL), enabling fast ad‑hoc analysis without the latency of data‑lake queries.

Materialized views: Pre‑aggregated views accelerate UI queries by avoiding costly joins across large tables.

Compared with the InfluxDB‑based system, the new setup delivers dramatically faster query response times and simplifies the data pipeline.

Query Performance & Optimization

StarRocks supports both synchronous (SYNC) and asynchronous (ASYNC) materialized views. Grab primarily uses ASYNC views because they allow multi‑table joins, essential for summarising job runs. Partition TTL (typically 33 days) and selective partition refresh keep storage costs low while preserving performance.

CREATE MATERIALIZED VIEW job_run_summaries_001
    REFRESH ASYNC EVERY(INTERVAL 1 DAY)
    AS
    SELECT platform,
           job_id,
           COUNT(DISTINCT run_id)                     AS count_run,
           CEIL(PERCENTILE_APPROX(total_instances, 0.95)) AS p95_total_instances,
           CEIL(PERCENTILE_APPROX(worker_instances, 0.95)) AS p95_worker_instances,
           PERCENTILE_APPROX(job_hour, 0.95)          AS p95_job_hour,
           PERCENTILE_APPROX(machine_hour, 0.95)      AS p95_machine_hour,
           PERCENTILE_APPROX(cpu_hour, 0.95)          AS p95_cpu_hour,
           PERCENTILE_APPROX(worker_gc_hour, 0.95)    AS p95_worker_gc_hour,
           CEIL(PERCENTILE_APPROX(driver_cpus, 0.95)) AS p95_driver_cpus,
           CEIL(PERCENTILE_APPROX(worker_cpus, 0.95))  AS p95_worker_cpus,
           CEIL(PERCENTILE_APPROX(driver_memory_gb, 0.95)) AS p95_driver_memory_gb,
           CEIL(PERCENTILE_APPROX(worker_memory_gb, 0.95)) AS p95_worker_memory_gb,
           PERCENTILE_APPROX(driver_cpu_utilization, 0.95) AS p95_driver_cpu_utilization,
           PERCENTILE_APPROX(worker_cpu_utilization, 0.95) AS p95_worker_cpu_utilization,
           PERCENTILE_APPROX(driver_memory_utilization, 0.95) AS p95_driver_memory_utilization,
           PERCENTILE_APPROX(worker_memory_utilization, 0.95) AS p95_worker_memory_utilization,
           PERCENTILE_APPROX(total_gb_read, 0.95)      AS p95_gb_read,
           PERCENTILE_APPROX(total_gb_written, 0.95)   AS p95_gb_written,
           PERCENTILE_APPROX(total_memory_gb_spilled, 0.95) AS p95_memory_gb_spilled,
           PERCENTILE_APPROX(disk_spilled_rate, 0.95) AS p95_disk_spilled_rate
    FROM iris.job_runs
    WHERE report_date >= CURRENT_DATE - INTERVAL 30 DAY
    GROUP BY platform, job_id;

Frontend Integration

The backend is built in Go, handling authentication, permission checks, and queries to StarRocks. The Iris UI, written in a custom web stack, presents task lists, status, and metadata, allowing users to quickly assess Spark job health and resource utilisation.

Advanced Analytics & Recommendations

Aggregated materialized views enable historical run analysis (e.g., 30‑day job summaries). Based on trend analysis, a recommendation API suggests optimisations such as scaling down under‑utilised clusters. These suggestions are surfaced directly in the Iris UI and integrated into Grab’s SpellVault Slackbot for instant access.

Migration & Rollout

Migrate real‑time CPU/memory charts from Grafana to the new Iris UI.

Decommission Grafana dashboards after migration.

Retain Superset for specialised BI queries.

Lessons Learned & Future Plans

Unified storage in StarRocks dramatically improves query performance and simplifies architecture.

Materialized views provide fast, pre‑aggregated insights.

Dynamic partitioning automatically manages data lifecycle, keeping storage efficient.

Direct Kafka ingestion reduces latency and pipeline complexity.

The relational model of StarRocks enables richer queries compared with time‑series‑only InfluxDB.

Going forward, Grab aims to enhance the recommendation engine, deepen advanced analytics, broaden integration with internal tools, explore machine‑learning‑driven insights, and continue scaling the platform to handle growing data volumes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

observabilityStarRocksKafkaData PlatformMaterialized ViewsSpark
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.