Big Data 19 min read

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

This article analyzes the challenges of building real‑time and batch risk‑control features, compares Lambda and Kappa architectures, evaluates storage‑unified and compute‑unified solutions, and details how StarRocks was selected, validated, and deployed to achieve high‑performance, low‑latency feature serving in a financial context.

Didi Tech

Mar 28, 2024

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

Background

Financial risk‑control relies on key features such as the last credit‑submission timestamp to assess and mitigate risk. Real‑time feature computation is a classic big‑data scenario requiring both low‑latency (seconds) and high‑concurrency (thousands of QPS) processing.

Initial Architecture Choices

The team initially adopted industry‑standard Lambda and Kappa architectures: Kappa for features within a 7‑day window and Lambda for longer‑term data. In Lambda, raw data is ingested from a single source, then split into a real‑time stream (written to a fast store) and an offline batch path (archived to Hive). While this separates concerns, it introduces storage redundancy, duplicated development effort for stream and batch logic, and complex debugging.

Kappa simplifies Lambda by removing the batch layer and using a message queue as the sole storage. This reduces development overhead but is limited by queue capacity and slower reprocessing throughput, making it unsuitable for large‑scale back‑fills.

Goal: Unified Stream‑Batch Processing

To eliminate the split‑logic problem, the team explored two unification directions:

Compute‑unified : a single engine (e.g., Flink) that can execute both streaming and batch jobs.

Storage‑unified : a data lake where both real‑time and offline data coexist, enabling a single query layer.

Hologres was examined for its real‑time write, update, and analytical capabilities, but its closed‑source nature and lack of internal adoption ruled it out.

Evaluating OLAP Engines

Two high‑performance analytical databases were compared: StarRocks and ClickHouse. Key criteria included learning curve, join support, QPS capacity, data‑change handling, SSB benchmark results, import speed, and operational overhead. The comparison highlighted StarRocks' MySQL compatibility, superior batch import speed, and better support for high‑concurrency point‑lookup scenarios, leading to its selection.

Validation Experiments

Validation focused on three aspects:

Data import : Batch loading of a 20 GB, billion‑row credit table completed in 11 minutes.

Real‑time ingestion : Streamed updates achieved ~1 second latency.

Query performance : Using the company’s query‑service, prefix‑index lookups averaged 20‑30 ms, secondary‑index queries 70‑90 ms, and non‑indexed scans 300‑400 ms. Concurrency testing on a 3‑FE + 5‑BE cluster reached 1500 QPS before the service throttled; the observed production peak QPS was 1109.

Key performance numbers (summarized):

Batch import time: 11 min
Real‑time latency: ~1 s
Prefix‑index latency: 20‑30 ms
Secondary‑index latency: 70‑90 ms
Full‑scan latency: 300‑400 ms
Peak QPS tested: 1500 (service stopped at 1500 QPS)
Production max QPS: 1109

Implementation Details

The final architecture adds two layers to the existing stack:

Batch layer : Full‑load of historical data into StarRocks.

Stream layer : Real‑time change data captured and written to the same StarRocks tables.

Mirror tables are created for business‑source tables, using a primary‑key model that supports fast delete‑and‑insert updates. Partitioning and bucketing are applied on high‑cardinality query keys (e.g., uid or bizId) to prune data efficiently. Only one bucketing column is used to avoid metadata bloat.

Service‑side changes expose StarRocks queries via the existing MySQL‑compatible client, making the new feature layer transparent to upstream applications.

Service Performance

After deployment, 34 real‑time features were exercised in a dry‑run, yielding the following latency percentiles:

99th‑percentile: 75 ms
95th‑percentile: 55 ms
Average: 39 ms

Update‑timeliness monitoring (select now()‑max(update_time)) showed that 36.79 % of updates were ≤1 s, 92.53 % ≤2 s, and 99.96 % ≤5 s, with a maximum observed delay of 8 seconds.

Benefits Achieved

Development cycle reduced from an average of 5 days (including offline batch scheduling) to near‑real‑time configuration.

Feature accuracy improved by unifying logic across real‑time and batch pipelines.

Support for table‑join features, delete operations, and a rich set of SQL functions.

Lower learning curve thanks to MySQL‑compatible SQL.

Resource consumption only occurs during query execution, avoiding waste from unused feature definitions.

Future Plans

Upcoming work includes adding standby clusters for higher availability, exploring tiered feature management, and investigating alternatives for log‑oriented data sources where the current StarRocks‑based solution is insufficient.