Building a Scalable Big Data Service Platform: Architecture & Low‑Code Orchestration
This article explains the end‑to‑end big data processing pipeline, outlines the diverse challenges of data interfaces, storage and performance, introduces the unified "Three Ones" approach, details a three‑layer service architecture, and shows how low‑code orchestration can streamline API creation and composition.
Introduction
The speaker first describes the typical big‑data processing flow: data collection → ETL (data cleaning) → data service → data visualization. OLAP‑oriented analytics and OLTP‑oriented online business data (e.g., e‑commerce orders, video‑platform logs) are both covered.
Background
Key problems identified include:
Diverse scenarios such as recommendation, marketing, reporting, dashboards, and data products.
Various interface types (API, RPC, real‑time streams, files) with different QPS and latency requirements.
Performance demands ranging from billions of QPS with millisecond latency to low‑throughput reporting.
Multiple storage options (HBase, Redis, MySQL, Doris, Hive, etc.).
Different execution engines (Java, C++, Go, client libraries, SQL).
Metric definitions that vary across business contexts.
Solution – The “Three Ones”
To address the above, the platform adopts three unification principles:
OneAPI : a unified data‑service interface that abstracts away differing QPS, latency, and transport protocols (HTTP, RPC, file transfer).
OneSQL : a single language layer that can parse and access multiple storage back‑ends.
OneModel : a unified data model that supports heterogeneous data sources.
Evolution Roadmap
The platform evolves through three stages (illustrated in the accompanying diagram), moving from isolated data pipelines to a fully service‑oriented architecture.
Core Architecture Design
The architecture is divided into three layers:
Data Application Access Layer : external applications connect via HTTP, RPC, client, stream, or file services.
Data Service Parsing Layer : built on Apache Calcite, it provides SQL parsing, validation, routing, optimization, execution, and diagnostic rate‑limiting.
Data Storage Layer : abstracts storage engines such as MySQL, Redis, Hive, HBase, etc., exposing them through unified APIs.
Platform governance features (permission management, monitoring, rate limiting, metadata management, service orchestration) are handled alongside these layers.
Data Service Production Workflow
Select a storage engine (e.g., HBase, MySQL).
Configure the query SQL and parameters.
Convert the SQL into an executable form for the chosen engine.
Generate an atomic API service.
//统计某一天的每个店铺的销售额
select shop_id, sum(gmv) as total_gmv
from (
select * from table where dt=#{dt}
) t
group by shop_id;Once generated, the API can be invoked directly or composed with other atomic services, but manual composition quickly becomes costly, motivating a low‑code approach.
Low‑Code Service Orchestration
Three orchestration patterns are demonstrated:
Serial composition : the output of an order service is transformed and fed into a product service (illustrated in the diagram).
Parallel composition : multiple independent services are called concurrently before a downstream service proceeds.
Conditional logic : runtime decisions determine which service branch to execute.
Outlook and Summary
Future directions include integrating emerging storage solutions (e.g., IoT‑oriented databases like Tdengine), expanding the platform’s audience to analysts and algorithm engineers, supporting algorithm service orchestration, and enhancing data quality and security mechanisms.
Q&A Highlights
Low‑code platforms can satisfy most data‑service requirements, though highly complex business logic may still need custom code.
Reporting services depend on the chosen transport protocol (HTTP, RPC, etc.).
The technical stack consists of three layers: storage, Apache Calcite‑based parsing, and Spring‑Boot‑based API services.
Stability is ensured through pre‑release testing, comprehensive monitoring, and gradual (gray) rollouts.
SQL diagnostics currently rely on execution time and QPS thresholds.
Supported storages include HBase, MySQL, Redis, and ClickHouse.
API lifecycle management includes permission control and periodic cleanup of unused APIs.
References
Internet‑scale big‑data service platform construction and practice (CSDN article).
Vipshop’s billion‑scale data service platform case study (InfoQ article).
"Alibaba’s Big Data Journey" book.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
