How Hera Transforms Vipshop’s Data Service for Scalable E‑Commerce Analytics
This article explains how Vipshop built the Hera data service since 2019 to provide a unified API for warehouse access, detailing its background, architecture, core features such as multi‑engine queries, adaptive execution, custom Lisp syntax, task scheduling, metrics collection, and performance gains for both B‑to‑B and B‑to‑C workloads.
Data services are a key component of a data‑middle‑platform, acting as a unified entry point for data warehouse access. They expose the warehouse as a single DB via API, handling data inflow and outflow to meet diverse data access needs.
Vipshop’s e‑commerce platform began building its own data service, Hera, in 2019. Starting from scratch, it now serves over 30 business lines with both B‑to‑B and B‑to‑C data services.
Background
Before a unified data service, warehouse access suffered from low efficiency and inconsistent metrics. Notable problems included:
Advertising audience (USP, DMP) required streaming export of massive data volumes from HiveServer, leading to severe latency under resource constraints.
Each table needed a separate interface for different query engines (Presto, ClickHouse), causing interface explosion and maintenance overhead.
Common metrics (sales, orders, PV, UV) were duplicated across data products with inconsistent definitions, making data quality verification difficult.
The new data service addresses these issues by abstracting storage and compute engines, offering a single API, layered data storage, engine‑agnostic SQL generation, adaptive execution, unified caching, and data registration with authorization.
Architecture Design
Hera follows a classic master‑slave model with separate data and control paths for high availability. It consists of three layers:
Application Access Layer – supports TCP client, HTTP, and internal RPC (OSP) interfaces.
Data Service Layer – handles routing, multi‑engine support, resource configuration, dynamic engine parameter assembly, SQLLisp engine generation, adaptive SQL execution, unified query cache, and FreeMarker‑based SQL generation.
Data Layer – provides unified API for data stored in warehouses, ClickHouse, MySQL, Redis, etc.
Main Functions
Multi‑Queue Scheduling Strategy
Tasks are assigned to queues based on user, task type, and weight, ensuring SLA for different workloads.
Multi‑Engine Query
Supports Spark, Presto, ClickHouse, Hive, MySQL, Redis, selecting the best engine per scenario.
Multiple Task Types
Handles ETL, adhoc, file export, and data import, enabling combinations like Spark‑adhoc and Presto‑adhoc.
File Export
Facilitates large‑scale data export for downstream analysis such as coupon distribution.
Resource Isolation
Separates core and non‑core workloads at both worker and engine resource levels.
Dynamic Engine Parameter Assembly
Automatically assembles and adjusts engine parameters per task, engine type, account, or business scenario.
Adaptive Engine Execution
If the chosen engine fails, the system switches to another engine to maintain SLA.
SQL Construction
Supports single‑table, star, and snowflake schemas for dimensional modeling.
Single‑table: one fact table (e.g., DWS or ADS summary).
Star: one fact table + N dimension tables.
Snowflake: fact table + N dimension tables + M indirect dimension tables.
Custom Lisp Syntax for Metric Formulas
Lisp provides a unified way to describe metric calculations, abstracting engine‑specific syntax. Examples include aggregation expressions like (count x [y,z...]), conditional expressions, type casting, and generic function calls.
Task Scheduling
Built on Netty for zero‑copy data transfer and uses a dedicated thread pool for business logic. Supports multi‑queue and multi‑user scheduling with weight‑based scoring that considers queue size, parallelism, and task timeout.
Scoring formula: score = jobWeight + queueDynamicFactor + queueWeight.
SQL Job Flow
Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for the target engine(s). The Master schedules the job to Workers; if the primary engine fails, other engines are tried. Results are sent directly to the client via zero‑copy.
Metrics Collection
Collects static metrics (master/worker/client info) and dynamic metrics (runtime memory usage, queue snapshots) via heartbeats.
Usage Statistics
Current daily calls: >9 million to C, >1.5 million to B (engine‑side).
ETL tasks finish in ~3 minutes.
Adhoc queries (Spark, Presto, ClickHouse) complete ~90% within 2 seconds; ClickHouse 99% within 1 second.
Performance Issues Solved
Hera improves SLA for audience (group) calculations, data migration, and data product reliability. Key improvements:
Co‑locating compute and storage reduces network traffic.
Mitigates HDFS hotspot and tail latency.
Supports online‑offline hybrid audience tasks.
Alluxio Cache Table Synchronization
Hive tables on HDFS are mirrored to Alluxio by replacing the location path. A periodic task detects new HDFS partitions and issues a SYN2ALLUXIO job to add matching partitions to the Alluxio table, after which Alluxio automatically syncs data.
Audience Calculation Task
Example Spark SQL inserts audience IDs from a Hive table. When the underlying table is cached in Alluxio, the SQL is rewritten to use the Alluxio table, achieving a 10%‑30% speedup.
Conclusion
Hera now supports many production services but still has open challenges, such as handling engine‑specific function differences (e.g., Presto vs. ClickHouse) and further improving HA/DR deployment on Kubernetes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
