Dimension Table Join Strategies in Apache Flink: Preload, Distributed Cache, Hot Storage, Broadcast, and Temporal Table Function
The article explains various dimension‑table join approaches in Apache Flink, including preloading tables into memory, using distributed cache, leveraging hot storage with async I/O, broadcasting state, and temporal table function joins, and compares their trade‑offs for different data volumes and update frequencies.
1. Background
Fact tables are usually stored in Kafka, while dimension tables reside in external systems such as MySQL or HBase. For each streaming record, a join with an external dimension source can enrich the data, but Flink SQL currently only supports joining with the current snapshot of the dimension table (processing‑time semantics), not with the snapshot corresponding to the fact table’s event time.
2. Dimension Table Join
Preload Dimension Table
The whole dimension table is loaded into memory by implementing a RichFlatMapFunction; the open() method reads the dimension database and caches the data, then the probe stream performs the join against the in‑memory data.
Advantages: simple to implement. Disadvantages: requires the entire dimension to fit in memory, updates need a job restart, leading to latency and high update cost. Suitable for small, infrequently changing dimensions.
Improvement: create a background thread in open() to periodically reload the dimension, avoiding manual restarts.
Distributed Cache
Uses Flink’s distributed cache to distribute a local dimension file to each TaskManager and load it into memory.
Register the file with env.registerCachedFile.
Implement a RichFunction and obtain the cached file in open() via RuntimeContext.
Parse and use the file data.
Because the data must fit in memory, it is only suitable for small, rarely updated dimension files such as static code tables or configuration files.
Hot Storage Association
Dimension data is imported into a hot store such as Redis, Tair, or HBase, then queried asynchronously with an optional cache layer (e.g., Guava Cache) and eviction policies, reducing pressure on the hot store.
Asynchronous I/O can issue many parallel requests, increasing throughput and lowering latency, but may shift bottlenecks to the external store, so caching helps mitigate that.
This approach avoids loading the whole dimension into memory, works for larger dimensions, but introduces cache expiration latency and depends on external hot‑storage resources.
Broadcast Dimension Table
The dimension stream is broadcast to downstream tasks using Broadcast State.
Send the dimension data to Kafka as a broadcast source stream S1.
Define a MapStateDescriptor and call S1.broadcast() to obtain broadcastStream S2.
Connect the non‑broadcast stream S3 with S2 via connect(), producing BroadcastConnectedStream S4.
Implement the join logic in a KeyedBroadcastProcessFunction or BroadcastProcessFunction and invoke S4.process().
Broadcast joins provide timely updates but still keep the dimension in memory, so they are suitable for dimensions that can be represented as a real‑time stream and are not extremely large.
Temporal Table Function Join
A Temporal Table is a view that can return the state of a changing table at a specific point in time, either from a changelog stream or a materialized external table.
The join is performed by a UDTF that probes the temporal table; it is appropriate when the dimension data is a changelog stream and versioned joins are required.
Define a TemporalTableFunction on the changelog stream, specifying a time attribute and the join key.
Register the TemporalTableFunction name in the TableEnvironment.
Dimension Table Join Solutions Comparison
3. Dual Stream Join
Batch processing offers Sort‑Merge Join and Hash Join. In dual‑stream scenarios, joins must continuously update as new records arrive.
Regular Join retains the state of both streams indefinitely, suitable only for bounded streams due to state size constraints.
Interval Join adds a time window constraint, keeping only records whose timestamps fall within a defined interval and share the same join key, allowing state cleanup.
Interval Join supports both processing‑time and event‑time; processing‑time uses system clocks, while event‑time relies on watermarks for windowing and state cleanup.
Window Join matches records with the same key that fall into the same window, similar to an inner join, requiring both key equality and window alignment.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
