Big Data 11 min read

How We Built a Scalable Real‑Time Data Architecture for a Complex Supply Chain

This article describes the challenges of a highly complex supply‑chain system, the evolution from early MySQL‑based reporting to a modern real‑time data platform using Flink, Kafka, ClickHouse, Hologres and other cloud services, and the tools and lessons learned to achieve low‑latency, high‑throughput analytics.

Alibaba Cloud Big Data AI Platform

Mar 1, 2023

How We Built a Scalable Real‑Time Data Architecture for a Complex Supply Chain

Background

DeWu's supply‑chain business is extremely complex, involving JIT spot‑sale, large‑scale warehousing, consignment, brand operations, and intricate reverse‑logistics. Precise monitoring of people, goods, venues, and vehicles is required, demanding fine‑grained data across many dimensions to ensure timely order fulfillment.

1.1 Early Stage

Initially, the backend management system suffered from slow report queries due to large‑table joins, index bloat, and difficulty performing multi‑dimensional analysis on massive data volumes. MySQL, designed for OLTP, could not support the emerging real‑time analytical needs.

To address this, the team explored a real‑time data architecture, aiming to unify fulfillment, warehousing, and transportation data.

2.1 Original Phase

2.1.1 Real‑time join with AnalyticDB for MySQL (ADB)

Using Alibaba Cloud DTS, business tables were synchronized in real time to ADB. ADB’s MySQL‑compatible syntax allowed arbitrary SQL joins, performing well for single‑table large‑data or simple dimension‑table joins, but struggled with complex queries, high memory/CPU consumption, and limited concurrency.

2.1.2 Building Wide Tables with Otter

Leveraging the open‑source Canal, incremental logs were captured and downstream processes generated wide tables stored in MySQL, improving single‑table query speed. However, Otter introduced high coupling, complex debugging, and added ETL logic to the MySQL replica.

2.2 Real‑time Architecture 1.0

2.2.1 Flink + Kafka + ClickHouse

After evaluating alternatives, the team adopted ClickHouse as the OLAP store. Data was streamed via Flink and Kafka, with ClickHouse handling append‑only writes, limited update/delete support, and requiring careful partitioning to maintain performance.

Append‑only writes generate part files that are periodically merged.

Weak update/delete capabilities hinder atomicity and real‑time guarantees.

ClickHouse excels with large volumes and simple models but struggles with frequent updates.

Compute‑storage coupling means complex queries can degrade write performance.

Given ClickHouse’s lack of upsert, the team pre‑aggregated data in Flink before writing, handling long‑lived supply‑chain data and extensive join requirements.

2.3 Real‑time Architecture 2.0

2.3.1 Flink + Kafka + Hologres

The need for an OLAP database with upsert, compute‑storage separation, decent join capability, robust partitioning, and backup led to evaluating Hologres. Its row‑column hybrid design matched the team’s expectations.

2.3.2 Challenges Encountered

Multi‑time Segment Key

Choosing an appropriate segment_key (ordered time field) reduces file scanning and improves query performance, even without explicit segment_key filters.

Batch‑Stream Fusion

To handle long‑running state TTL and limited Kafka retention, the solution merged offline deduplicated data with real‑time streams, using last_value for the latest event per key and union‑all + group‑by as a join alternative.

Join Operator Disorder

Hash‑based join partitioning caused out‑of‑order output when header_id values changed across parallel tasks, leading to data loss. The fix involved joining header to detail first to obtain a stable detail_id, then re‑joining on that key.

insert into sink
Select detail.id,detail.header_id,header.id
from detail
left join (
    Select detail.id AS detail_id,detail.header_id,header.id
    from header 
    inner join detail
    on detail.header_id  =  header.id 
) headerNew
on detail.id  =  headerNew.detail_id

2.3.3 Hologres vs. StarRocks

The team compared Hologres and StarRocks, outlining each system’s strengths and shortcomings for their workload.

3 Development Efficiency Tools

3.1 Flink SQL Code Generator

Inspired by MyBatis Generator, a template engine creates Flink SQL code, enforcing standards and boosting productivity.

3.2 Visual Configuration Platform

An online UI allows users to compose SQL, generate pages and APIs, and publish with one click, incorporating caching and queueing mechanisms for high‑traffic scenarios.

Future Plans

The current architecture still faces a “impossible triangle.” Future explorations include:

Writing data to Hologres while computing in MaxCompute to offload memory pressure.

Adopting Apache Hudi for lake‑house integration, enabling unified batch‑stream storage and computation with near‑real‑time latency.

These directions aim to further reduce latency, improve scalability, and lower operational costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Streaming Kafka ClickHouse Real-time Data

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.