Design and Architecture of Hera Data Service for Unified Data Access at Vipshop
The article details the background, architecture, core features, scheduling mechanisms, Lisp‑based query DSL, and Alluxio integration of Vipshop's self‑developed Hera data service, illustrating how it unifies multi‑engine data access, improves SLA, and accelerates large‑scale crowd computing tasks.
Data Service is a critical component of the data middle‑platform, acting as a unified entry point for applications to access the data warehouse as a single DB, providing consistent API control for data inflow and outflow to meet diverse user requirements.
Vipshop began building its own data service, Hera, in 2019, evolving from scratch to serving more than 30 business lines with both B‑to‑B and B‑to‑C data services.
Background
Before a unified data service, the warehouse suffered from low efficiency, fragmented interfaces, and inconsistent metric definitions, leading to high latency for large‑scale audience (USP/DMP) exports and difficulty maintaining numerous engine‑specific APIs.
Architecture Design
Hera follows a classic master‑slave model with separate data and control flows to ensure high availability. It consists of three layers:
Application Access Layer – supports TCP client, HTTP, and internal RPC (OSP) interfaces.
Data Service Layer – handles routing, multi‑engine support, engine resource configuration, dynamic parameter assembly, SQLLisp engine generation, adaptive execution, unified query cache, and FreeMarker‑based SQL generation.
Data Layer – abstracts underlying storage (data warehouse, ClickHouse, MySQL, Redis) behind a single API.
Core Functions
Multi‑queue scheduling with weight‑based SLA guarantees.
Multi‑engine query support (Spark, Presto, ClickHouse, Hive, MySQL, Redis).
Support for various task types: ETL, adhoc, file export, data import.
File export pipeline that writes results to HDFS/Alluxio and allows TCP download, reducing crowd export time from >30 min to ≤3 min.
Resource isolation for core vs. non‑core workloads.
Dynamic engine‑parameter assembly and adaptive engine execution that falls back to another engine on failure.
SQL construction based on dimensional modeling (single‑table, star, snowflake).
Lisp‑Based Metric DSL
Hera defines a custom Lisp syntax for expressing metric calculations, abstracting engine‑specific details. Example expressions include aggregation, conditional, type‑casting, and generic function calls. The DSL is parsed with ANTLR4, traversed by a listener, and translated into engine‑specific SQL.
Task Scheduling
Built on Netty, the scheduler uses a single EventLoopGroup for network I/O and a separate thread pool for business logic. It implements multi‑queue, multi‑user scheduling with weight‑based scoring:
queue_dynamic_factor = queue_size / queue_capacity * (1 - running_jobs / queue_parallelism) job_weight = 1 - (current_time - enqueue_time) / timeout score = job_weight + queue_dynamic_factor + queue_weightJobs are ordered first by queue weight, then by dynamic factor, and finally by individual job weight.
SQL Job Flow
Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for the target engine (Spark, Presto, ClickHouse). The Master schedules the job to Workers; if the chosen engine fails, the job is retried on another engine. Results are streamed back to the client via the Worker.
Metrics Collection
Both static (node info) and dynamic (runtime) metrics are gathered. Workers report heartbeat data, including memory usage, enabling the Master to make informed scheduling decisions.
Alluxio Cache Synchronization
Hive tables can be mirrored to Alluxio by replacing the HDFS location with an Alluxio path. A periodic task detects new partitions in the source table and triggers a SYN2ALLUXIO job to keep the Alluxio cache up‑to‑date.
CREATE TABLE `hdfs.ads_tags_table`(
`oaid_md5` string,
`mid` string,
`user_id` bigint,
...
) PARTITIONED BY (`dt` string) LOCATION 'hdfs://xxxx/hdfs.db/ads_tags_table'; CREATE TABLE `alluxio.ads_tags_table`(
`oaid_md5` string,
`mid` string,
`user_id` bigint,
...
) PARTITIONED BY (`dt` string) LOCATION 'alluxio://zk@IP1:2181,IP2:2181/alluxio.db/ads_tags_table';Crowd Computing Example
A Spark ETL job that reads from the HDFS table can be automatically rewritten to use the Alluxio table, achieving a 10‑30% speedup.
INSERT INTO hive_advert.cd0000127760_full
SELECT result_id, '20210703'
FROM (
SELECT oaid_md5 AS result_id
FROM hdfs.ads_tags_table AS ta
WHERE ta.dt = '20210702' AND ...
) AS t; INSERT INTO hive_advert.cd0000127760_full
SELECT result_id, '20210703'
FROM (
SELECT oaid_md5 AS result_id
FROM alluxio.ads_tags_table AS ta
WHERE ta.dt = '20210702' AND ...
) AS t;Conclusion
Hera now supports many production workloads but still faces challenges such as inconsistent function signatures across engines and further optimization of multi‑engine compatibility.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.