Real-time Dimension Modeling with Flink SQL: Challenges and Solutions
This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.
Introduction: JD.com BI engineer Yang Shang'ang shares a case of using Flink SQL for real‑time dimension modeling, covering problem definition, challenges, solution, and future planning.
Problem
Two difficult scenarios are examined: (1) real‑time multi‑stream full‑join where a stream may need to join with full historical data; (2) real‑time stream full‑group aggregation (e.g., ranking or min calculation) that requires both current and historical records. Simple SQL statements appear straightforward:
select * from A full join B on A.name = B.name; select id, name, val, row_number() over (partition by name order by val) as rn from A; select name, min(val) from A group by k;However, practical use faces issues such as late arrival of historical data, large state size, and memory‑intensive RocksDB storage.
Difficulty
Key difficulties include acquiring complete historical data for joins, managing state efficiently, and achieving acceptable performance. Real‑time dimension modeling requires external KV stores (e.g., HBase, Redis) to persist full history, careful handling of message ordering, and strategies to reduce I/O and increase parallelism.
Solution
A component‑based architecture called RDM Building is proposed. Users configure dimension‑modeling components, which are compiled into Flink operators (keyBy, flatMap, etc.). The “HisRows Component” fetches historical records from KV storage, merges them with incoming streams (A, B, C), and produces enriched streams (A_all, B_all, C_all) that can be processed with ordinary Flink SQL.
The design ensures that each key is routed to the same slot, improving cache hit rates and consistency, while supporting both insert and delete tuples for change‑data‑capture scenarios.
Planning
Future work includes building a visual front‑end for component configuration, extending support to other streaming engines (e.g., Spark Streaming), and adding more KV back‑ends beyond HBase and Redis to lower the learning curve for users.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.