Big Data 14 min read

How StarRocks Boosted Suixingfu’s Real‑Time Data Platform: 3× Faster Queries & 10× Faster Analytics

Suixingfu rebuilt its payment data pipeline by replacing a fragmented Lambda stack with a unified Porter CDC + StarRocks + Elasticsearch architecture, achieving three‑fold query speed, ten‑fold analytics efficiency, 20% storage reduction, and sub‑second data‑capture latency across high‑concurrency, ad‑hoc, and batch workloads.

StarRocks

Jul 1, 2025

Overview

Suixingfu rebuilt its data analytics platform to replace a heterogeneous Lambda architecture (traditional DB + Hive + Elasticsearch + Kudu + HBase) with a unified real‑time stack based on Porter CDC for change‑data capture, StarRocks as the analytical engine, and Elasticsearch for secondary indexing.

Business Challenges

Poor data timeliness : Offline and real‑time clusters were isolated, causing long pipelines and slow cross‑cluster queries.

High data redundancy : The same data was stored in multiple systems, inflating storage cost and synchronization overhead.

Complex and unstable architecture : Numerous heterogeneous components increased coupling and could not guarantee stability during traffic spikes.

Why StarRocks

Benchmarking against Doris, ClickHouse and other OLAP engines showed that StarRocks delivered the lowest query latency, highest throughput and richer feature set (vectorized execution, primary‑key model, multi‑table joins, CBO optimizer). These advantages made it the core engine for the new stack.

Real‑Time Architecture

The new pipeline eliminates the legacy “DB + Hive + Elasticsearch + Kudu + HBase” stack. Data changes are captured by Porter CDC, streamed through Flink, and ingested into StarRocks. Elasticsearch provides secondary indexes for multi‑condition filtering. The overall flow supports high‑concurrency point queries, ad‑hoc aggregations and heavy batch calculations.

Data Collection Enhancements

Dynamic field updates : Porter extracts only changed columns from binlog, reducing write pressure on the source DB. Flink connectors were extended to support column‑level updates, and CDC logs are pre‑merged, cutting I/O by ~30%.

Configuration‑driven multi‑table sync : A single Flink task can synchronize dozens of tables. Real‑time rate limiting and task‑status monitoring are built into the job configuration.

Offline sync acceleration : StarRocks EXPORT is used to dump data to a data lake, achieving >10× speedup compared with Hive. The INSERT INTO FILES command provides a unified export interface.

EXPORT TABLE orders TO 's3://datalake/orders/' WITH (format='parquet');
INSERT INTO FILES 's3://datalake/orders/' SELECT * FROM orders;

Query Optimizations

Workloads are classified into four core query types and tuned accordingly.

High‑concurrency detail queries (e.g., historical refund, merchant lookup):

Primary‑key tables stored on SSD for low‑latency point lookups.

Asynchronous materialized views refresh on a schedule, delivering millisecond‑level responses on 30 TB tables (≈20× faster than full scans).

Elasticsearch secondary indexes complement StarRocks for multi‑condition filters.

Ad‑hoc aggregation queries (e.g., risk‑control dashboards):

HDD‑based StarRocks clusters balance cost and latency.

Metrics are pre‑aggregated at write time to reduce runtime computation.

Colocate joins keep fact and dimension tables on the same shards, improving join speed by ~3×.

CBO optimizer cuts plan generation time by 80%.

Batch processing (periodic reports, dimension table calculations):

Dynamic partition pruning and predicate push‑down reduce scanned data by 73%.

Multi‑tenant resource isolation and tuned compaction parameters improve stability under concurrent jobs.

Grafana dashboards consume StarRocks monitoring metrics; alerts fire within 5 seconds.

Key Benefits

Query latency reduced threefold; sub‑second response for billions of detail rows.

Report generation time shortened by >3 hours, yielding a tenfold increase in analytics efficiency.

Data redundancy cut by ~20% after removing Kudu, HBase and ClickHouse.

End‑to‑end capture latency (P99) compressed to <1 second, enabling true second‑level visibility.

Future Plans

StarRocks’ federation and external‑table catalog will be used to query heterogeneous stores (Hudi, Elasticsearch, etc.) without moving data. The goal is a “lake‑warehouse‑service” platform that abstracts underlying sources and delivers consistent, high‑performance analytics for all downstream applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink real-time analytics StarRocks Data Warehouse CDC

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.