Building a Real‑Time Streaming Data Warehouse with Paimon on Kubernetes for Supply‑Chain Logistics
This article presents a step‑by‑step guide on how the logistics provider Haicheng Bangda implemented a streaming data warehouse using Paimon, Flink CDC, and Kubernetes, covering business background, architecture choices, environment setup, SQL examples, troubleshooting tips, and future roadmap for their digital transformation.
The document introduces Haicheng Bangda, a supply‑chain logistics service provider with over 2,000 employees and annual revenue exceeding 12 billion CNY, and explains the need for real‑time monitoring of business processes as the company scales.
Business background outlines the responsibilities of the Operations & Process Management Department, which tracks order volumes, revenue, and resource usage across multiple regions and business units.
Data‑warehouse batch processing architecture shows the existing offline batch pipeline and the limitations of the current reporting tools, motivating a shift to a streaming solution.
Big‑data technology pain points and selection compares Hadoop, Lambda, and Kappa architectures, highlighting the drawbacks of each (lack of real‑time support, operational complexity, or limited offline capabilities) and introduces Apache Iceberg as a potential but imperfect alternative.
Streaming data warehouse (Kappa continuation) describes the adoption of Paimon as the core storage engine, enabling end‑to‑end traceability, layer reuse, and combined batch‑and‑stream processing while reducing storage and compute waste.
Production practice details the deployment of Flink 1.16.0 on Kubernetes using the StreamPark platform. It includes:
Downloading and extracting Flink binaries.
Configuring flink-conf.yaml with job manager and task manager settings, checkpoint directories, OSS integration, and checkpoint retention.
Creating a Dockerfile that adds the OSS‑FS‑Hadoop plugin.
Building and pushing a custom Flink image to an Alibaba Cloud container registry.
Setting up Kubernetes namespaces, RBAC, secrets, and OSS CSI volumes for persistent checkpoints and savepoints.
Defining a pod template that mounts the checkpoint and savepoint PVCs.
SQL examples illustrate how to create temporary source tables for PostgreSQL, MySQL, and SQL Server using CDC connectors, then create matching Paimon ODS tables, and finally insert data into ODS, DWD, DWM, and DWS layers with appropriate partitioning, bucket settings, and aggregation properties. All SQL code snippets are wrapped in ... tags.
Problem analysis and solutions covers common issues such as inaccurate aggregation in ADS layers (resolved by using 'changelog-producer'='full-compaction' and adjusting changelog-producer.compaction-interval), missing updates due to a conflicting 'sequence.field', unsupported retraction for non‑SUM aggregate functions (mitigated by setting 'fields.<field>.ignore-retract'='true'), Flink task timeouts (solved by increasing akka.ask.timeout and web.timeout), and checkpoint failures caused by CPU‑intensive processing (addressed by increasing parallelism, task slots, and job manager resources).
Future planning mentions integrating Paimon metadata into the internal BonData platform for data governance, connecting Trino to Doris for a unified offline‑and‑online service, and continuing the stream‑batch unified data‑warehouse construction across the group.
The article concludes with a call to action for readers to like, follow, and bookmark the content.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
