Big Data 13 min read

Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents JD's real-time dimension modeling case using Flink SQL, detailing two complex streaming scenarios, the difficulties of handling historical data and state management, and a component‑based solution that leverages external KV stores and optimized Flink operators to improve performance and scalability.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Real-time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

Introduction: This article shares JD's real-time modeling case using Flink SQL, covering problem, difficulty, solution, and planning.

Problem : Two challenging scenarios are described: (1) real‑time multi‑stream full join, which may need to join one stream with full historical data; (2) real‑time stream full‑group aggregation such as row_number and min calculations. Simple SQL statements are shown:

select * from A full join B on A.name = B.name;
select id, name, val, row_number() over (partition by name order by val) as rn from A;
select name, min(val) from A group by k;

Directly using these SQLs faces issues: missing historical data at the start, large state storage in memory or RocksDB, and performance concerns.

Challenges : (1) Obtaining historical data requires external KV stores (e.g., HBase, Redis) and handling time‑range consistency; (2) Improving performance involves reducing I/O and increasing concurrency via async I/O, parallelism, and caching.

Solution : A component‑based design (RDM Building) is proposed, where users configure dimension‑modeling components that are parsed into Flink operators. Components such as HisRows handle fetching historical data, keyBy, window aggregation, and merging streams (A_all, B_all, C_all) for downstream Flink SQL computation.

Images illustrate the processing flow and component execution.

Planning : Future work includes a visual configuration UI, support for multiple streaming engines (Spark Streaming, etc.), and broader KV store integration beyond HBase and Redis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-TimeBig DataFlinkSQLStreamingdimension modeling
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.