Incremental Computation in Big Data: Flink Materialized Table and Paimon
The article explains how Flink 1.20’s Materialized Table combined with Paimon’s changelog storage enables incremental computation that unifies batch and streaming workloads, delivering minute‑level latency at lower cost, illustrated by a materialized‑table example while noting current streaming‑only support and future batch extensions.
Since the "three‑horse‑carriage" era, big‑data processing has evolved from offline Hive+Spark to real‑time Flink, yet low‑cost near‑real‑time processing remains challenging.
Incremental computation (also known as Incremental View Maintenance in databases) is gaining attention to unify batch and stream processing. Flink 1.20 introduces Materialized Table (MT) to combine both modes, while Paimon provides Changelog storage.
The article first explains incremental computation concepts, then presents a Flink‑Paimon case study, highlighting current capabilities and limitations.
Key reasons for the renewed interest:
Desire for a unified batch‑stream architecture; differing computation models hinder code reuse.
Trade‑off between cost and data freshness; pure streaming incurs high resource cost, while batch has latency.
Incremental computation can balance cost and latency by adjusting computation frequency.
Flink’s model is based on Changelog‑driven incremental computation, which records state changes. In practice, state TTL assumptions and rollback operations affect accuracy.
Implementation differences:
Databases capture changes via internal mechanisms (e.g., PostgreSQL AFTER triggers) and rewrite queries for incremental results.
Big‑data engines use a generic Changelog model (e.g., Paimon’s Changelog Producer) and let the engine handle incremental logic.
A concrete Flink‑MT + Paimon example creates a catalog, source table, and a materialized table that aggregates order counts per logistics company. The SQL statements are:
SET 'sql-client.execution.result-mode' = 'tableau';
SET 'table.exec.sink.upsert-materialize' = 'NONE';
SET 'execution.runtime-mode' = 'streaming';
CREATE CATALOG paimon WITH ('type'='paimon','warehouse'='file:/tmp/paimon');
CREATE TABLE tms_source (
order_id STRING PRIMARY KEY KEY NOT ENFORCED,
tms_company STRING NOT NULL
) WITH (
'connector'='paimon',
'path'='file:/tmp/paimon/default.db/tms_source',
'changelog-producer'='lookup',
'scan.remove-normalize'='true'
);
CREATE MATERIALIZED TABLE continues_tms_res (
CONSTRAINT `pk_tms_company` PRIMARY KEY (tms_company) NOT ENFORCED
) WITH (
'format'='debezium-json',
'scan.remove-normalize'='true',
'changelog-producer'='lookup',
'merge-engine'='aggregation',
'fields.order_cnt.aggregate-function'='sum'
) FRESHNESS = INTERVAL '30' SECOND
AS SELECT tms_company, 1 AS order_cnt FROM tms_source;
INSERT INTO tms_source VALUES ('001','ZhongTong');
INSERT INTO tms_source VALUES ('002','ZhongTong');
INSERT INTO tms_source VALUES ('001','YuanTong');Querying the source and materialized tables shows Changelog records (+I, -U, +U) and the aggregated order counts, demonstrating incremental updates without full recomputation.
Although the current Flink batch mode does not support Changelog processing, the streaming mode proves that incremental computation can achieve low‑cost near‑real‑time analytics. Future work may extend this to batch scheduling.
Conclusion: Incremental computation bridges the gap between batch and streaming, offering minute‑level latency at reasonable cost, but pure real‑time scenarios still require dedicated architectures.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.