Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain
This article describes how Shuhai Supply Chain upgraded its data warehouse from a complex, high‑cost 1.0 architecture to a streamlined, real‑time solution built around Apache Doris, detailing the motivations, design choices, zero‑code ingestion, metadata management, Flink connector, and the resulting performance gains.
Shuhai Supply Chain, founded in 2011, provides end‑to‑end catering supply‑chain services and generates massive amounts of data across sales, finance, procurement, warehousing, transportation, and more.
In the original 1.0 architecture, data production chains were long, the system was overly complex, and both development and operational costs were high. Problems included redundant data copies, lack of standard SQL support, slow aggregation, high development effort, and poor responsiveness to business data requests.
To address these issues, the team migrated to a 2.0 architecture centered on Apache Doris . Doris was chosen for its strong functionality, MySQL compatibility, online schema change, ease of operation, scalability, and high availability.
The new data‑warehouse pipeline consists of unified data collection (Canal for MySQL binlog, Flume for logs, custom interfaces), message queuing with Kafka for high‑throughput ingestion, Flink for ETL and real‑time statistics, and Doris Stream Load for loading data into the warehouse. Data quality management covers metadata, quality, standards, and security.
Data ingestion is simplified into three methods: Routine Load for asynchronous real‑time data, Stream Load for zero‑code business data ingestion, and INSERT INTO for scheduled table generation. This shortens the data chain and improves real‑time capabilities.
Zero‑code ingestion is achieved by using Canal to capture MySQL binlog, converting DataX output to the same format, and feeding it to Flink which writes to Doris via Stream Load. Rules for ETL are defined in a visual engine and automatically deployed to Flink jobs.
Metadata management builds a data map and lineage by parsing Doris audit logs and manually defining relationships in the ODS layer, providing searchable physical and logical models.
A no‑code API development platform lets analysts create data services without writing code, supporting visual enable/disable, black‑/white‑list access control, rate limiting, and circuit breaking.
The team also contributed a Flink Doris Connector to the open‑source community, enabling parallel reads and writes with FlinkSQL for large‑scale, low‑latency analytics.
Performance results show a 10‑node Doris cluster handling millisecond‑level responses for dozens of business lines, supporting million‑row tables with sub‑second queries, and ingesting 300‑400 k rows per second without impacting analytics.
Overall, the Doris‑based warehouse reduced development effort, accelerated data onboarding, improved query speed, and provided a stable, scalable platform for real‑time business intelligence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
