Big Data 16 min read

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.

ITPUB

Aug 13, 2022

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

01 Build Risk Control System with Flink

Flink currently serves all business units of the group, reaching a peak capacity of 4 billion events per second during Double‑11, handling more than 30 000 jobs and over 1 million CPU cores. It underpins data‑mid‑platform, AI‑mid‑platform, risk‑mid‑platform, real‑time O&M, and recommendation services.

Risk‑control basics

Risk control is divided into a 3 × 2 matrix: the "2" dimension distinguishes rule‑based versus algorithm/model‑based approaches, while the "3" dimension covers pre‑risk, in‑risk, and post‑risk.

Three risk‑control business types

Pre‑risk stores trained models or pre‑computed data in Redis, MongoDB, etc., and can use rule engines like Sidden, Groovy, Drools to fetch data. In‑risk and post‑risk are asynchronous, with typical latency around 200 ms, suitable for RPC/HTTP calls. Flink is used for asynchronous risk requests with latencies of 1–2 seconds, enabling machine‑assisted decisions.

Why Flink is the best choice for rule‑based risk

Event‑driven processing

Millisecond‑level latency

Unified stream‑batch capabilities

Three elements of rule‑based risk

Facts : raw risk events from business or logs, serving as input.

Rules : business‑defined conditions that must be satisfied.

Thresholds : severity levels associated with rules.

Flink rule expression enhancements

Flink supports stateless and stateful rules. Stateless rules perform simple ETL or vectorization for downstream TensorFlow models. Stateful rules are the core of Flink risk control and include:

Statistical rules : e.g., more than 100 visits within 5 minutes triggers risk.

Sequence rules : detect specific event sequences such as click → add‑to‑cart → delete, which may indicate malicious behavior.

Hybrid rules : combination of statistical and sequence logic.

02 Alibaba Risk Control Practice

The practice is organized into three modules: perception, disposition, and insight.

Perception : detect anomalies and early signals, such as unusual data distributions or sudden spikes in product clicks.

Disposition : execute rules across hour‑level, real‑time, and offline layers, improving accuracy by correlating recent user behavior.

Insight : uncover hidden risk patterns that cannot be expressed by static rules, using high‑dimensional feature projection and time‑aware analysis.

Stage 1: SQL Real‑time Correlation & Statistics

Simple SQL aggregates (e.g., SUM(amount) > 50) are used for real‑time thresholds. Each distinct threshold requires a separate Flink job, leading to resource waste.

Stage 2: Broadcast Stream

BroadcastStream allows dynamic distribution of rule updates without restarting jobs. New thresholds are broadcast to all operators, enabling on‑the‑fly adjustments (e.g., changing a 1‑minute 10‑visit limit to 20 during a promotion).

Stage 3: Dynamic CEP

Dynamic CEP decouples rule storage (RDS, Hologres) from execution, allowing rules to be updated or replaced at runtime. This also exposes rule authoring to business users via a rule‑center.

Stage 4: Shared Computing

Shared Computing lets multiple business domains share a single Flink CEP job, separating the engine, platform, and business layers and reducing duplicated resource consumption.

Stage 5: Separation of Business Development and Platform Construction

Business teams define a DSL for risk rules, which the platform submits via an Open API, enabling a unified middle‑platform for ~100 BUs.

03 Large‑Scale Risk Control Technical Challenges

3.1 Fine‑grained Resource Adjustment

Flink slots allow independent scaling of source and CEP operators. For example, a job may run 2 source parallelisms but 2000 CEP parallelisms, saving resources. Heterogeneous TaskManagers (TM) let CEP nodes have larger CPU/memory than source nodes, reducing cost by ~20% compared to homogeneous deployments.

3.2 Stream‑Batch Unity & Adaptive Batch Scheduler

Unified execution ensures CEP rules produce identical results in batch mode, avoiding separate batch implementations. An adaptive batch scheduler dynamically adjusts parallelism based on CEP output volume, providing elastic batch processing.

3.3 Merged Reads to Reduce Common‑Layer Pressure

Instead of each DWS table reading the same DWD source, a merged source reads the DWD once and materializes multiple DWS tables downstream, dramatically lowering load on the shared data layer.

3.4 KV‑Separated State Backend

State storage (e.g., RocksDB) is split into KV and value components. Alibaba Cloud Flink uses the Gemini backend, delivering at least a 100% performance boost for CEP workloads.

3.5 Dimensional Data Partition Loading

Large dimension tables (TB‑scale) are partitioned via Shuffle, ensuring each dimension node loads only its relevant partition, making massive historical‑behavior joins feasible.

04 Alibaba Cloud Flink FY23 Evolution Plan

The FY23 roadmap focuses on four enhancements:

Expression power

Observability

Execution capability

Performance

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba CEP Big Data Flink Streaming risk control

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.