Big Data 26 min read

Real-Time Anti-Fraud Streaming System Based on Flink: Architecture, Challenges, and Optimizations

The article describes a Flink‑based real‑time anti‑fraud streaming system that combines a risk‑control platform, configurable YAML‑driven pipelines, and optimized state handling—using early event‑time triggers, micro‑batch caching, and coarse‑grained key reduction—to compute multi‑dimensional features, support rapid strategy updates, simulation filtering, and seamless output to ClickHouse, Hive, and Redis for both instant monitoring and offline analysis.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Real-Time Anti-Fraud Streaming System Based on Flink: Architecture, Challenges, and Optimizations

This article presents a comprehensive real-time anti‑fraud streaming system built on Apache Flink, designed to handle high‑traffic scenarios with complex feature computation, rapid strategy updates, simulation filtering, and multi‑warehouse integration.

It first classifies anti‑fraud systems into online, real‑time, and offline types, highlighting the trade‑offs between latency and data richness. Online anti‑fraud offers millisecond‑level response, offline provides deep analysis, while real‑time balances both.

The core challenges addressed include:

Complex multi‑dimensional feature calculation across various time windows (seconds, minutes, days) and dimensions (user, device, IP).

Data disorder caused by network latency, requiring robust out‑of‑order handling.

State‑backend pressure under high concurrency, especially when RocksDB is accessed for each event.

Frequent high‑frequency strategy updates and configuration management risks.

Need for simulation filtering to validate new strategies before production deployment.

Requirement for seamless data output to multiple warehouses (ClickHouse, Hive, Redis) for real‑time monitoring and offline analysis.

The system architecture consists of three main modules: a risk‑control platform for configuration distribution, a Flink real‑time job that parses configurations and executes ETL, feature computation, rule matching, and a storage layer (ClickHouse, Hive, Redis, message queues) for result persistence and downstream consumption.

Key technical optimizations for windowed feature aggregation are detailed:

Early Trigger: Custom event‑time based triggers emit partial results before window closure, achieving second‑ or minute‑level latency.

Batch Updates & Memory Cache: Micro‑batch processing reduces RocksDB read/write frequency; an in‑memory cache serves most accesses, cutting state backend calls by ~90%.

Key Reduction (Coarse‑grained KeyBy): Modulo‑based partitioning (e.g., uid % 100) dramatically lowers the number of triggers and state entries, while preserving per‑UID accuracy.

The system is highly configuration‑driven. Engineering configurations (input, output, parallelism) and strategy configurations (field extraction, feature definitions, dictionary tables, model paths, rule thresholds) are expressed in YAML files, enabling rapid strategy iteration without code changes.

Simulation filtering is implemented by reusing the same pipeline with isolated test configurations. For offline data, an HDFS Parquet source with file‑level sorting ensures deterministic ordering, achieving ~99% validation accuracy.

Data analysis capabilities are provided through both real‑time ClickHouse tables for dashboards and alerts, and offline Hive tables for deep mining and model training. Integration with a self‑service analytics platform (TDA) allows business users to run SQL queries and visualizations without developer assistance.

In summary, the Flink‑based real‑time anti‑fraud system delivers low‑latency, high‑throughput risk detection, robust state management, and flexible configuration, significantly improving detection efficiency and operational stability.

Big DataflinkReal-time Streaminganti-fraudconfigurationfeature computationstate optimization
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.