How to Pick Real-Time Dimension & Result Tables for Cloud‑Native Big Data
This article examines the evolution of big‑data architectures toward cloud‑native, real‑time processing, and provides a detailed comparison of dimension‑table and result‑table options—including MySQL, Redis, and Alibaba Cloud Tablestore—along with their performance, cost, and scalability characteristics for Flink SQL workloads.
Preface
Traditional big‑data technologies originated from Google’s GFS, MapReduce, and Bigtable, later open‑sourced as HDFS, MapReduce, and HBase. Early deployments relied on on‑premise Hadoop clusters for offline batch processing and large‑scale storage.
With growing data volume, diverse use cases, and the rise of cloud services, classic architectures can no longer meet requirements. Modern big‑data systems now emphasize three trends: massive scale, real‑time processing, and cloud‑native design.
Real‑Time Computing in Big‑Data Architecture
Typical real‑time scenarios include:
Real‑time data warehouses for PV/UV, transaction, and sales statistics.
Real‑time recommendation engines powered by AI.
Streaming ETL for data synchronization and preprocessing.
Real‑time fraud detection and monitoring in finance.
Apache Flink has become the de‑facto engine for low‑latency stream processing, offering a SQL front‑end that abstracts away the underlying execution details.
Flink SQL provides three table types:
Source tables – e.g., Kafka, MQ, or CDC streams.
Result tables – destinations such as MySQL, HBase, etc.
Dimension tables – stores of static or slowly changing reference data (e.g., MySQL, Redis).
Below is a typical Flink SQL job that calculates per‑minute GMV by joining a Kafka source with a Redis dimension table and writing the result to MySQL:
# Source table - user order data
CREATE TEMPORARY TABLE user_action_source (
`timestamp` BIGINT,
`user_id` BIGINT,
`item_id` BIGINT,
`price` DOUBLE
) WITH (
'connector' = 'kafka',
'topic' = '<your_topic>',
'properties.bootstrap.servers' = 'your_kafka_server:9092',
'properties.group.id' = '<your_consumer_group>',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
);
# Dimension table - item details
CREATE TEMPORARY TABLE item_detail_dim (
id STRING,
catagory STRING,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'redis',
'host' = '<your_redis_host>',
'port' = '<your_redis_port>',
'password' = '<your_redis_password>',
'dbNum' = '<your_db_num>'
);
# Result table - GMV per minute and category
CREATE TEMPORARY TABLE gmv_output (
time_minute STRING,
catagory STRING,
gmv DOUBLE,
PRIMARY KEY (time_minute, catagory)
) WITH (
type='rds',
url='<your_jdbc_mysql_url_with_database>',
tableName='<your_table>',
userName='<your_mysql_database_username>',
password='<your_mysql_password>'
);
INSERT INTO gmv_output
SELECT
TUMBLE_START(s.timestamp, INTERVAL '1' MINUTES) AS time_minute,
d.catagory,
SUM(d.price) AS gmv
FROM user_action_source s
JOIN item_detail_dim FOR SYSTEM_TIME AS OF PROCTIME() AS d
ON s.item_id = d.id
GROUP BY TUMBLE(s.timestamp, INTERVAL '1' MINUTES), d.catagory;Dimension Table Design
Dimension tables must satisfy high‑throughput, low‑latency reads, tight integration with the compute engine, and elastic compute resources. Three common solutions are evaluated:
MySQL – widely used but suffers from limited storage elasticity, high cost at scale, and poor performance for massive QPS.
Redis – offers millisecond‑level reads and easy horizontal scaling, yet memory cost becomes prohibitive for large dimension datasets.
Tablestore (Alibaba Cloud) – a cloud‑native NoSQL store with storage‑compute separation, serverless operation, high‑throughput writes, multi‑model indexing, and built‑in CDC (Tunnel Service) for seamless Flink integration.
Cost‑performance comparisons across four workload patterns (high/low storage × high/low compute) show Tablestore delivering the best balance for large‑scale, high‑throughput scenarios, while Redis excels in low‑storage, high‑compute cases.
Result Table Design
Result tables store the output of real‑time jobs and must support:
Petabyte‑scale storage.
Rich query and aggregation capabilities (OLAP, search, MPP).
High‑throughput writes.
Data derivation via CDC.
Cloud‑native, storage‑compute separation.
Candidate technologies include:
MySQL – familiar but limited by storage cost, query capability, and write throughput.
HBase – scalable, high‑throughput writes, but weak query support, no CDC, and high operational overhead.
HBase + Elasticsearch – adds search capabilities but doubles infrastructure and still lacks CDC.
Tablestore – provides high‑throughput writes, multi‑type indexing, CDC, and a serverless, cost‑effective cloud‑native model.
For a scenario requiring billions of e‑commerce orders (≈1 TB) with 1 000 writes/sec and occasional analytical queries, Tablestore achieves the required performance at the lowest estimated monthly cost.
Conclusion
The article presents a comprehensive analysis of real‑time dimension and result table choices within a cloud‑native big‑data stack. Alibaba Cloud Tablestore stands out for its elasticity, low cost, and tight Flink integration, making it a strong candidate for most large‑scale streaming workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
