Big Data 18 min read

How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi

An in‑depth case study of a Chinese bank’s near‑real‑time data platform reveals its evolution from a monolithic CDC pipeline to a split architecture featuring a real‑time data lake and a data‑service bus, detailing component choices, schema‑registry integration, SDK development, observability, and future roadmap.

dbaplus Community
dbaplus Community
dbaplus Community
How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi

Near‑Real‑Time Data Platform Overview

Initiated in 2016, the platform follows a layered design: a source layer (Kafka), a standardization layer, and a publishing layer. Data ingestion consists of three main paths:

Change‑data‑capture (CDC) from relational databases via Oracle GoldenGate (OGG) → Kafka source topic.

Latency‑sensitive asynchronous exports written directly with the Kafka Producer API.

Complex computation results from Flink or Spark jobs written to the publishing layer.

Standardization jobs (Spark Streaming or Flink) apply common business logic, then write the processed records to a Kafka standard topic. Downstream applications consume from this standard layer.

Architecture Evolution – Real‑Time Data Lake

The platform was split into two sub‑platforms: Real‑Time Data Lake and Data Service Bus.

Real‑Time Data Lake uses Kafka for ingestion and Apache Hudi on HDFS for storage, delivering minute‑level data visibility. Two import modes are supported:

Batch import of existing lake data via Spark.

Real‑time ingestion from the Data Service Bus using Flink, writing directly into Hudi tables.

Data flow:

Source layer: CDC or API writes to Kafka.

Flink streaming jobs read from the source, transform records, and write to Hudi.

Hudi provides incremental pull‑up and snapshot queries; Presto serves OLAP queries.

Export layer supports three scenarios:

Batch export via Spark/Hive for stable, low‑latency downstream data.

Second‑level real‑time export via Flink.

Interactive OLAP queries via Presto.

This architecture achieves minute‑level latency, bridging the gap between day‑level lake delays and second‑level bus delays.

Architecture Evolution – Data Service Bus

The Data Service Bus centers on Confluent's open‑source Schema Registry to decouple data schemas from payloads.

Schemas are stored in a dedicated Kafka topic ( _schemas) and cached in memory for fast reads.

A RESTful API provides schema registration and retrieval.

High‑availability is achieved through multi‑instance deployment, client‑side caching, and cross‑domain replication.

Security integrates Kerberos, a custom RBAC plugin, and audit logging.

A custom Kafka SDK encapsulates producer/consumer logic, offering version stability, simplified upgrades, disaster‑recovery switching, and fine‑grained data‑face control.

Observability improvements include detailed Kafka metrics (e.g., request‑thread busy rate, consumer latency) and a control console for tenant‑level schema management (create, update, delete).

Schema Registry High‑Availability & Cross‑Domain Replication

Each Schema Registry instance writes registration requests to the _schemas Kafka topic; read requests are served from an in‑memory cache.

Client HA: schemas are cached locally; if a server is unavailable, only new schema registrations are affected.

Server HA: multiple independent instances each maintain their own cache and persist to the same Kafka topic, avoiding single‑point failures.

Cross‑domain replication: a primary instance in domain A writes to _schemas; a secondary instance in domain B reads from the same topic and serves read requests, enabling schema access across data centers.

Disaster‑recovery: scripts can switch the primary node and update client configurations automatically.

Consumption Customization

The SDK integrates with an internal Schema Manager service. Downstream consumers can request only the fields they need (e.g., order status vs. logistics). The SDK fetches the appropriate schema from the Schema Registry and the custom schema from Schema Manager, applies compatibility handling, and delivers filtered records.

Observability & Monitoring

Kafka metrics are collected via Prometheus, including request‑thread busy rate, consumer lag, and per‑topic throughput. Alerts trigger on threshold breaches. A web console visualizes tenant‑level schemas and operational metrics.

Future Plans

Continue migrating all minute‑level source scenarios to the Real‑Time Data Lake and explore lake‑warehouse and unified streaming‑batch architectures.

Enhance the Data Service Bus: implement client disaster‑recovery, extend the Flink connector, enrich the management console, and build long‑term Xinchu (信创) clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Flinkreal-time dataData LakeBig Data ArchitectureSchema Registry
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.