Big Data 14 min read

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

This article presents a JD.com case study on applying Flink SQL for real‑time dimension modeling, detailing two complex streaming scenarios—full‑join of multiple streams and full‑group aggregation—along with the associated challenges of historical data handling, state management, and performance optimization, and proposes component‑based architectural solutions.

DataFunSummit

Apr 6, 2022

Real-time Dimension Modeling with Flink SQL: Challenges and Solutions

Introduction: JD.com BI engineer Yang Shang'ang shares a case of using Flink SQL for real‑time dimension modeling, covering problem definition, challenges, solution, and future planning.

Problem

Two difficult scenarios are examined: (1) real‑time multi‑stream full‑join where a stream may need to join with full historical data; (2) real‑time stream full‑group aggregation (e.g., ranking or min calculation) that requires both current and historical records. Simple SQL statements appear straightforward:

select * from A full join B on A.name = B.name;

select id, name, val, row_number() over (partition by name order by val) as rn from A;

select name, min(val) from A group by k;

However, practical use faces issues such as late arrival of historical data, large state size, and memory‑intensive RocksDB storage.

Difficulty

Key difficulties include acquiring complete historical data for joins, managing state efficiently, and achieving acceptable performance. Real‑time dimension modeling requires external KV stores (e.g., HBase, Redis) to persist full history, careful handling of message ordering, and strategies to reduce I/O and increase parallelism.

Solution

A component‑based architecture called RDM Building is proposed. Users configure dimension‑modeling components, which are compiled into Flink operators (keyBy, flatMap, etc.). The “HisRows Component” fetches historical records from KV storage, merges them with incoming streams (A, B, C), and produces enriched streams (A_all, B_all, C_all) that can be processed with ordinary Flink SQL.

The design ensures that each key is routed to the same slot, improving cache hit rates and consistency, while supporting both insert and delete tuples for change‑data‑capture scenarios.

Planning

Future work includes building a visual front‑end for component configuration, extending support to other streaming engines (e.g., Spark Streaming), and adding more KV back‑ends beyond HBase and Redis to lower the learning curve for users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Big Data Flink SQL Streaming dimension modeling

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.