How Alibaba Uses Flink to Power Massive Real‑Time Risk Control
This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.
01 Build Risk Control System with Flink
Flink currently serves all business units of the group, reaching a peak capacity of 4 billion events per second during Double‑11, handling more than 30 000 jobs and over 1 million CPU cores. It underpins data‑mid‑platform, AI‑mid‑platform, risk‑mid‑platform, real‑time O&M, and recommendation services.
Risk‑control basics
Risk control is divided into a 3 × 2 matrix: the "2" dimension distinguishes rule‑based versus algorithm/model‑based approaches, while the "3" dimension covers pre‑risk, in‑risk, and post‑risk.
Three risk‑control business types
Pre‑risk stores trained models or pre‑computed data in Redis, MongoDB, etc., and can use rule engines like Sidden, Groovy, Drools to fetch data. In‑risk and post‑risk are asynchronous, with typical latency around 200 ms, suitable for RPC/HTTP calls. Flink is used for asynchronous risk requests with latencies of 1–2 seconds, enabling machine‑assisted decisions.
Why Flink is the best choice for rule‑based risk
Event‑driven processing
Millisecond‑level latency
Unified stream‑batch capabilities
Three elements of rule‑based risk
Facts : raw risk events from business or logs, serving as input.
Rules : business‑defined conditions that must be satisfied.
Thresholds : severity levels associated with rules.
Flink rule expression enhancements
Flink supports stateless and stateful rules. Stateless rules perform simple ETL or vectorization for downstream TensorFlow models. Stateful rules are the core of Flink risk control and include:
Statistical rules : e.g., more than 100 visits within 5 minutes triggers risk.
Sequence rules : detect specific event sequences such as click → add‑to‑cart → delete, which may indicate malicious behavior.
Hybrid rules : combination of statistical and sequence logic.
02 Alibaba Risk Control Practice
The practice is organized into three modules: perception, disposition, and insight.
Perception : detect anomalies and early signals, such as unusual data distributions or sudden spikes in product clicks.
Disposition : execute rules across hour‑level, real‑time, and offline layers, improving accuracy by correlating recent user behavior.
Insight : uncover hidden risk patterns that cannot be expressed by static rules, using high‑dimensional feature projection and time‑aware analysis.
Stage 1: SQL Real‑time Correlation & Statistics
Simple SQL aggregates (e.g., SUM(amount) > 50) are used for real‑time thresholds. Each distinct threshold requires a separate Flink job, leading to resource waste.
Stage 2: Broadcast Stream
BroadcastStream allows dynamic distribution of rule updates without restarting jobs. New thresholds are broadcast to all operators, enabling on‑the‑fly adjustments (e.g., changing a 1‑minute 10‑visit limit to 20 during a promotion).
Stage 3: Dynamic CEP
Dynamic CEP decouples rule storage (RDS, Hologres) from execution, allowing rules to be updated or replaced at runtime. This also exposes rule authoring to business users via a rule‑center.
Stage 4: Shared Computing
Shared Computing lets multiple business domains share a single Flink CEP job, separating the engine, platform, and business layers and reducing duplicated resource consumption.
Stage 5: Separation of Business Development and Platform Construction
Business teams define a DSL for risk rules, which the platform submits via an Open API, enabling a unified middle‑platform for ~100 BUs.
03 Large‑Scale Risk Control Technical Challenges
3.1 Fine‑grained Resource Adjustment
Flink slots allow independent scaling of source and CEP operators. For example, a job may run 2 source parallelisms but 2000 CEP parallelisms, saving resources. Heterogeneous TaskManagers (TM) let CEP nodes have larger CPU/memory than source nodes, reducing cost by ~20% compared to homogeneous deployments.
3.2 Stream‑Batch Unity & Adaptive Batch Scheduler
Unified execution ensures CEP rules produce identical results in batch mode, avoiding separate batch implementations. An adaptive batch scheduler dynamically adjusts parallelism based on CEP output volume, providing elastic batch processing.
3.3 Merged Reads to Reduce Common‑Layer Pressure
Instead of each DWS table reading the same DWD source, a merged source reads the DWD once and materializes multiple DWS tables downstream, dramatically lowering load on the shared data layer.
3.4 KV‑Separated State Backend
State storage (e.g., RocksDB) is split into KV and value components. Alibaba Cloud Flink uses the Gemini backend, delivering at least a 100% performance boost for CEP workloads.
3.5 Dimensional Data Partition Loading
Large dimension tables (TB‑scale) are partitioned via Shuffle, ensuring each dimension node loads only its relevant partition, making massive historical‑behavior joins feasible.
04 Alibaba Cloud Flink FY23 Evolution Plan
The FY23 roadmap focuses on four enhancements:
Expression power
Observability
Execution capability
Performance
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
