How a Scalable Data Service Platform Transforms Big Data into APIs
This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.
Background
As business expands, data demand grows, making efficient data usage essential. Data service‑ization, essentially data SaaS, converts data into services such as APIs, RPC, or files. Pain points include diverse data ingestion methods, inconsistent data definitions, lack of sharing, unclear data lineage, and uneven service quality. The platform must provide standardized interfaces, a universal data service gateway, manageable data links, observable services, and reusability, while being flexible, convenient, and low‑cost.
Architecture Design
The platform, built atop a data warehouse, consists of a data construction layer, data query layer, service interfaces, service gateway, and application ecosystem, plus modules for data standards, security, and monitoring. The core service chain is: data construction → data query → service interface & gateway.
2.1 Data Construction
Warehouse tables (Hive) are transformed into a unified business‑oriented data model layer because raw tables cannot fully express business logic and have high latency. The construction layer serves data producers (warehouse developers) and offers model definition, model acceleration, and API construction.
2.1.1 Model Definition
Using dimensional modeling (star, snowflake, constellation, etc.), the platform supports various logical data models to meet diverse analytical needs.
Model types:
Single model: one fact table (e.g., dws or ads layer).
Star model: one fact table + multiple dimension tables.
Snowflake model: fact table + dimension tables + indirect dimension tables.
Constellation model: multiple fact tables sharing dimensions.
2.1.2 Model Acceleration
Because Hive tables are unsuitable for online APIs, models are accelerated onto hot engines. Two acceleration strategies are offered:
Detail acceleration: Mirrors data from cold to hot engines, preserving full detail for multidimensional analysis.
Pre‑calculation acceleration: Aggregates data to the required granularity before loading into the hot engine, boosting query speed at the cost of flexibility.
Recommended engine combos per scenario:
Online: Pre‑calc + KV store.
Near‑online: Pre‑calc + TiDB/MySQL.
OLAP: Detail + ClickHouse or Iceberg.
Offline: Direct Hive access.
2.1.3 API Construction
API Parameter Definition
Standard elements include API ID, name, method, path, request/response parameters, latency, QPS estimate, and scenario.
API Data Retrieval Logic
Supports three construction methods:
Custom SQL: Uses MyBatis‑style dynamic SQL (if, foreach, where) to build queries at runtime.
select
a.field1 AS alias_1,
a.field2 as alias_2,
a.field3 as alias_3,
b.field1 as alias_4
from
fact_table a
left outer join
table_dim b
on a.id = b.id
<where>
a.field = ${ input_1,type = number }
<if test = 'input_2 !=null'>
and b.field = ${ input_2,type = number }
</if>
</where>Model Construction: Visual configuration without writing SQL.
Metric‑Dimension Construction: Advanced mode that auto‑selects models based on configured parameters.
2.2 Data Query
The query layer sits between service interfaces and data models, handling atomic and composite calculations.
2.2.1 Atomic Calculation
Processes DSL input through scheduling, translation, and engine execution.
Scheduling: Parses DSL, matches APIs to models, splits tasks, and merges results.
Translation: Converts sub‑task specifications into engine‑specific SQL via a two‑layer AST.
Engine Adapter: Supports KV, TiDB, MySQL, ClickHouse, Iceberg with connection pooling and timeout handling.
2.2.2 Composite Calculation
Performs secondary processing on atomic results for trends, ratios, funnels, and statistical analyses, supporting custom functions.
2.3 Service Gateway and Interface
2.3.1 Service Gateway
Provides unified entry with authentication (appKey/secret), rate limiting based on QPS estimates, and monitoring of request volume, success rate, failures, and security metrics.
2.3.2 Service Interface
Supports synchronous (fast, small results) and asynchronous (large, offline) queries. Interface types include DSL, template, and raw SQL.
DSL Interface: Describes data needs in a domain‑specific language.
message OpenApiReq {
OsHeader osHeader = 1;
repeated OperatorVo filters = 2;
repeated string metrics = 3;
repeated string dims = 4;
repeated string orderFields = 5;
PageVo pageVo = 6;
repeated OperatorVo metricFilters = 7;
}Template Query Interface: Fixed computation logic with variable parameters.
message SqlQueryReq {
OsHeader osHeader = 1;
repeated OperatorVo filters = 2;
}SQL Interface: Allows users to submit raw SQL for execution.
message AsyncSqlQueryReq{
string appKey = 1;
string secret = 2;
string engine = 3;
string sql = 4;
}General Solutions
3.1 Unified Metrics and Traceability
Standardizes metric definitions and model bindings, automates model export, and links API logic to metric definitions for consistent end‑to‑end data flow.
3.2 Cost Reduction and Efficiency
By extracting common services, the platform reduces duplicate development, shortens data lifecycle from definition to consumption, and cuts API creation time from five days to under one day, achieving roughly 18% cost savings.
3.3 High Availability
Implements service isolation, multi‑region active‑active deployment, and two‑level caching (local and distributed) with version management to ensure resilience and fast response.
Implementation Results
After about a year of development, the platform hosts over 500 APIs with daily QPS in the hundreds of thousands, supports major B‑Station events, and has reduced API creation time from five days to one day while cutting production costs by roughly 18%.
Future Plans
Focus areas include improving service stability and robustness, adding intelligent automation to reduce manual metadata registration, and establishing long‑term governance to monitor API health, usage, and reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
