Big Data 25 min read

How a Scalable Data Service Platform Transforms Big Data into APIs

This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.

Data Thinking Notes

Jan 3, 2023

How a Scalable Data Service Platform Transforms Big Data into APIs

Background

As business expands, data demand grows, making efficient data usage essential. Data service‑ization, essentially data SaaS, converts data into services such as APIs, RPC, or files. Pain points include diverse data ingestion methods, inconsistent data definitions, lack of sharing, unclear data lineage, and uneven service quality. The platform must provide standardized interfaces, a universal data service gateway, manageable data links, observable services, and reusability, while being flexible, convenient, and low‑cost.

Architecture Design

The platform, built atop a data warehouse, consists of a data construction layer, data query layer, service interfaces, service gateway, and application ecosystem, plus modules for data standards, security, and monitoring. The core service chain is: data construction → data query → service interface & gateway.

2.1 Data Construction

Warehouse tables (Hive) are transformed into a unified business‑oriented data model layer because raw tables cannot fully express business logic and have high latency. The construction layer serves data producers (warehouse developers) and offers model definition, model acceleration, and API construction.

2.1.1 Model Definition

Using dimensional modeling (star, snowflake, constellation, etc.), the platform supports various logical data models to meet diverse analytical needs.

Model types:

Single model: one fact table (e.g., dws or ads layer).

Star model: one fact table + multiple dimension tables.

Snowflake model: fact table + dimension tables + indirect dimension tables.

Constellation model: multiple fact tables sharing dimensions.

2.1.2 Model Acceleration

Because Hive tables are unsuitable for online APIs, models are accelerated onto hot engines. Two acceleration strategies are offered:

Detail acceleration: Mirrors data from cold to hot engines, preserving full detail for multidimensional analysis.

Pre‑calculation acceleration: Aggregates data to the required granularity before loading into the hot engine, boosting query speed at the cost of flexibility.

Recommended engine combos per scenario:

Online: Pre‑calc + KV store.

Near‑online: Pre‑calc + TiDB/MySQL.

OLAP: Detail + ClickHouse or Iceberg.

Offline: Direct Hive access.

2.1.3 API Construction

API Parameter Definition

Standard elements include API ID, name, method, path, request/response parameters, latency, QPS estimate, and scenario.

API Data Retrieval Logic

Supports three construction methods:

Custom SQL: Uses MyBatis‑style dynamic SQL (if, foreach, where) to build queries at runtime.

select
  a.field1 AS alias_1,
  a.field2 as alias_2,
  a.field3 as alias_3,
  b.field1 as alias_4
from
  fact_table a
left outer join
  table_dim b
on a.id = b.id
<where>
    a.field = ${ input_1,type = number }
    <if test = 'input_2 !=null'>
    and b.field = ${ input_2,type = number }
    </if>
</where>

Model Construction: Visual configuration without writing SQL.

Metric‑Dimension Construction: Advanced mode that auto‑selects models based on configured parameters.

2.2 Data Query

The query layer sits between service interfaces and data models, handling atomic and composite calculations.

2.2.1 Atomic Calculation

Processes DSL input through scheduling, translation, and engine execution.

Scheduling: Parses DSL, matches APIs to models, splits tasks, and merges results.

Translation: Converts sub‑task specifications into engine‑specific SQL via a two‑layer AST.

Engine Adapter: Supports KV, TiDB, MySQL, ClickHouse, Iceberg with connection pooling and timeout handling.

2.2.2 Composite Calculation

Performs secondary processing on atomic results for trends, ratios, funnels, and statistical analyses, supporting custom functions.

2.3 Service Gateway and Interface

2.3.1 Service Gateway

Provides unified entry with authentication (appKey/secret), rate limiting based on QPS estimates, and monitoring of request volume, success rate, failures, and security metrics.

2.3.2 Service Interface

Supports synchronous (fast, small results) and asynchronous (large, offline) queries. Interface types include DSL, template, and raw SQL.

DSL Interface: Describes data needs in a domain‑specific language.

message OpenApiReq {
    OsHeader osHeader = 1;
    repeated OperatorVo filters = 2;
    repeated string metrics = 3;
    repeated string dims = 4;
    repeated string orderFields = 5;
    PageVo pageVo = 6;
    repeated OperatorVo metricFilters = 7;
}

Template Query Interface: Fixed computation logic with variable parameters.

message SqlQueryReq {
    OsHeader osHeader = 1;
    repeated OperatorVo filters = 2;
}

SQL Interface: Allows users to submit raw SQL for execution.

message AsyncSqlQueryReq{
    string appKey = 1;
    string secret = 2;
    string engine = 3;
    string sql = 4;
}

General Solutions

3.1 Unified Metrics and Traceability

Standardizes metric definitions and model bindings, automates model export, and links API logic to metric definitions for consistent end‑to‑end data flow.

3.2 Cost Reduction and Efficiency

By extracting common services, the platform reduces duplicate development, shortens data lifecycle from definition to consumption, and cuts API creation time from five days to under one day, achieving roughly 18% cost savings.

3.3 High Availability

Implements service isolation, multi‑region active‑active deployment, and two‑level caching (local and distributed) with version management to ensure resilience and fast response.

Implementation Results

After about a year of development, the platform hosts over 500 APIs with daily QPS in the hundreds of thousands, supports major B‑Station events, and has reduced API creation time from five days to one day while cutting production costs by roughly 18%.

Future Plans

Focus areas include improving service stability and robustness, adding intelligent automation to reduce manual metadata registration, and establishing long‑term governance to monitor API health, usage, and reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data API Gateway data modeling Data Platform Service Architecture

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.