How a Traditional Finance Firm Tackles Real‑Time Analytics with Flink
This article details a financial company's exploration of Apache Flink for real‑time processing, covering its unique business constraints, end‑to‑end data pipeline, single‑table and multi‑table use cases, implementation challenges, code snippets, data initialization, testing strategies, and lessons learned.
Business Background
Traditional financial institutions process form‑based business records that change frequently, have long update cycles (days to months), require recomputation of historical statistics when dimension tables change, and demand high monetary accuracy.
Real‑time Data Processing Flow
Data are captured via Canal (MySQL binlog) and Flume, written to Kafka. Flink SQL defines source and sink tables, processes streams, and writes results to MySQL, Elasticsearch, HBase, or back to Kafka for downstream jobs.
Two data sources: business data (MySQL → Canal → Kafka) and log data (Flume → Kafka).
Define Flink SQL source/sink tables linking Kafka topics to target stores.
Optionally upload compiled Flink JARs to the platform.
Create and run Flink jobs on YARN.
Sink final results to MySQL/ES/HBase or Kafka.
Use Case 1: Single‑Table Real‑time Calculation
Goal: From a repayment‑plan table compute per‑order metrics such as first‑month payment, first repayment date, first overdue date, current period, paid amount and remaining amount.
Implementation uses Flink DataStream API. Key steps:
public static void main(String[] args) throws Exception {
// Execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Checkpoint every 10 min
env.enableCheckpointing(600_000);
// RocksDB state backend
String chkpointPath = "hdfs://ha/xxx/checkpoint";
StateBackend backend = new RocksDBStateBackend(chkpointPath);
env.setStateBackend(backend);
// Event‑time
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// Kafka consumer
Properties props = new Properties();
props.setProperty("bootstrap.servers", kafkaIPs);
props.setProperty("group.id", "group.cyb.1");
FlinkKafkaConsumer<KafkaData<List<CustPlanSource>>> consumer =
new FlinkKafkaConsumer<>(topic, new KafkaDeserializationScheme(), props);
consumer.assignTimestampsAndWatermarks(new PeriodicTsAndWmarks());
// Start from a given timestamp
long startTs = Long.parseLong(args[0]);
consumer.setStartFromTimestamp(startTs);
// Pipeline
SingleOutputStreamOperator<ResultInfo> ds = env
.addSource(consumer)
.filter(f -> f.getTable().equalsIgnoreCase("plan_cust") && !f.getIsDdl())
.flatMap(new DataFlatMap())
.keyBy(RepayPlanCustOut::getApply_no)
.window(TumblingEventTimeWindows.of(Time.seconds(20)))
.process(new CustProcessWindowFun())
.returns(new TypeHint<ResultInfo>(){})
.setParallelism(15);
ds.writeUsingOutputFormat(new SinkOutFormat()).setParallelism(15);
env.execute("planMain-1");
}The pipeline keeps the latest binlog for each order in Flink state, merges it with HBase detail tables when windows close, and writes aggregated results back to an HBase result table.
Data Initialization and Testing
Historical MySQL data are first synchronized to HBase using Hive external tables. The same Hive scripts generate expected statistical results, which are then compared with the Flink output via a Python verification script.
Use Case 2: Multi‑Table Join Real‑time Calculation
Four solution attempts were evaluated to join a main table with several dimension tables (contract, customer, product, etc.) in real time.
Solution 1 (Flink 1.5 + HBase dimension tables) : Sync each dimension table to HBase and join in Flink. Simple but introduces latency because dimension updates must propagate to HBase first.
Solution 2 (Flink 1.10 + MySQL dimension tables) : Use MySQL directly as dimension tables, eliminating HBase lag. However, updates to dimension tables do not trigger recomputation of the join.
Solution 3 (All tables in Kafka) : Stream every table into Kafka, let Flink read the main table and join with dimension tables stored in MySQL. Any table update triggers recomputation, but the result table is rewritten entirely each time, causing low efficiency.
Solution 4 (Result table as incremental dimension) : Keep the result table as a dimension table and update only the fields that changed per task. This reduces write overhead while still supporting real‑time updates from any source.
The experiments illustrate trade‑offs among simplicity, latency, and write efficiency when applying Flink to traditional finance workloads.
Conclusion
Flink provides a fast, event‑time, stateful stream‑processing platform that integrates well with Kafka, MySQL, HBase and Elasticsearch. In financial scenarios with long update cycles and strict accuracy requirements, practical challenges such as state management, checkpointing, and triggering on dimension‑table updates require deep Flink expertise and careful architectural design.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
