Big Data 13 min read

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

DataFunTalk

Mar 24, 2022

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

The sharing session, presented by JD.com BI engineer Yang Shang'ang, introduces a practical case of using Flink SQL for real‑time dimension modeling, organized into problem description, difficulty analysis, solution design, and future planning.

Problem 1 : Real‑time multi‑stream full‑join where one stream must be joined with the full historical dataset. Example SQL: select * from A full join B on A.name = B.name; Problem 2 : Real‑time stream full‑group aggregation, such as row_number and min calculations on continuously changing groups. Example SQL snippets:

select id, name, val, row_number() over (partition by name order by val) as rn from A;

select name, min(val) from A group by name;

The challenges include acquiring historical data for streams, managing large state in memory or external KV stores (e.g., HBase, Redis), ensuring correct event ordering, and achieving acceptable performance by reducing I/O and increasing concurrency.

The proposed solution adopts a component‑based design called RDM Building, consisting of three layers: Components Config, RDDM Component Builder, and RDDM Component Parse. User‑defined configurations generate component objects that are translated into Flink SQL operators. The HisRows Component fetches historical data from KV storage, merges it with incoming streams (A, B, C), and produces enriched streams (A_all, B_all, C_all) for downstream SQL computation.

Key processing steps: keyBy on join keys, window aggregation to coalesce multiple binlog messages, conversion to unified tuple format (including insert and delete records), union of streams, flatMap to load historical data from cache or KV store, and final SQL execution on the enriched streams.

Future planning includes building a visual front‑end for component configuration, extending support to other real‑time engines such as Spark Streaming, and broadening KV storage options beyond HBase and Redis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Big Data Flink SQL dimension modeling

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.