Big Data 21 min read

How Hera Transforms Vipshop’s Data Service for Scalable E‑Commerce Analytics

This article explains how Vipshop built the Hera data service since 2019 to provide a unified API for warehouse access, detailing its background, architecture, core features such as multi‑engine queries, adaptive execution, custom Lisp syntax, task scheduling, metrics collection, and performance gains for both B‑to‑B and B‑to‑C workloads.

ITFLY8 Architecture Home

Aug 3, 2021

How Hera Transforms Vipshop’s Data Service for Scalable E‑Commerce Analytics

Data services are a key component of a data‑middle‑platform, acting as a unified entry point for data warehouse access. They expose the warehouse as a single DB via API, handling data inflow and outflow to meet diverse data access needs.

Vipshop’s e‑commerce platform began building its own data service, Hera, in 2019. Starting from scratch, it now serves over 30 business lines with both B‑to‑B and B‑to‑C data services.

Background

Before a unified data service, warehouse access suffered from low efficiency and inconsistent metrics. Notable problems included:

Advertising audience (USP, DMP) required streaming export of massive data volumes from HiveServer, leading to severe latency under resource constraints.

Each table needed a separate interface for different query engines (Presto, ClickHouse), causing interface explosion and maintenance overhead.

Common metrics (sales, orders, PV, UV) were duplicated across data products with inconsistent definitions, making data quality verification difficult.

The new data service addresses these issues by abstracting storage and compute engines, offering a single API, layered data storage, engine‑agnostic SQL generation, adaptive execution, unified caching, and data registration with authorization.

Architecture Design

Hera follows a classic master‑slave model with separate data and control paths for high availability. It consists of three layers:

Application Access Layer – supports TCP client, HTTP, and internal RPC (OSP) interfaces.

Data Service Layer – handles routing, multi‑engine support, resource configuration, dynamic engine parameter assembly, SQLLisp engine generation, adaptive SQL execution, unified query cache, and FreeMarker‑based SQL generation.

Data Layer – provides unified API for data stored in warehouses, ClickHouse, MySQL, Redis, etc.

Main Functions

Multi‑Queue Scheduling Strategy

Tasks are assigned to queues based on user, task type, and weight, ensuring SLA for different workloads.

Multi‑Engine Query

Supports Spark, Presto, ClickHouse, Hive, MySQL, Redis, selecting the best engine per scenario.

Multiple Task Types

Handles ETL, adhoc, file export, and data import, enabling combinations like Spark‑adhoc and Presto‑adhoc.

File Export

Facilitates large‑scale data export for downstream analysis such as coupon distribution.

Resource Isolation

Separates core and non‑core workloads at both worker and engine resource levels.

Dynamic Engine Parameter Assembly

Automatically assembles and adjusts engine parameters per task, engine type, account, or business scenario.

Adaptive Engine Execution

If the chosen engine fails, the system switches to another engine to maintain SLA.

SQL Construction

Supports single‑table, star, and snowflake schemas for dimensional modeling.

Single‑table: one fact table (e.g., DWS or ADS summary).

Star: one fact table + N dimension tables.

Snowflake: fact table + N dimension tables + M indirect dimension tables.

Custom Lisp Syntax for Metric Formulas

Lisp provides a unified way to describe metric calculations, abstracting engine‑specific syntax. Examples include aggregation expressions like (count x [y,z...]), conditional expressions, type casting, and generic function calls.

Task Scheduling

Built on Netty for zero‑copy data transfer and uses a dedicated thread pool for business logic. Supports multi‑queue and multi‑user scheduling with weight‑based scoring that considers queue size, parallelism, and task timeout.

Scoring formula: score = jobWeight + queueDynamicFactor + queueWeight.

SQL Job Flow

Clients submit raw SQL (e.g., Presto). The SQLParser rewrites it for the target engine(s). The Master schedules the job to Workers; if the primary engine fails, other engines are tried. Results are sent directly to the client via zero‑copy.

Metrics Collection

Collects static metrics (master/worker/client info) and dynamic metrics (runtime memory usage, queue snapshots) via heartbeats.

Usage Statistics

Current daily calls: >9 million to C, >1.5 million to B (engine‑side).

ETL tasks finish in ~3 minutes.

Adhoc queries (Spark, Presto, ClickHouse) complete ~90% within 2 seconds; ClickHouse 99% within 1 second.

Performance Issues Solved

Hera improves SLA for audience (group) calculations, data migration, and data product reliability. Key improvements:

Co‑locating compute and storage reduces network traffic.

Mitigates HDFS hotspot and tail latency.

Supports online‑offline hybrid audience tasks.

Alluxio Cache Table Synchronization

Hive tables on HDFS are mirrored to Alluxio by replacing the location path. A periodic task detects new HDFS partitions and issues a SYN2ALLUXIO job to add matching partitions to the Alluxio table, after which Alluxio automatically syncs data.

Audience Calculation Task

Example Spark SQL inserts audience IDs from a Hive table. When the underlying table is cached in Alluxio, the SQL is rewritten to use the Alluxio table, achieving a 10%‑30% speedup.

Conclusion

Hera now supports many production services but still has open challenges, such as handling engine‑specific function differences (e.g., Presto vs. ClickHouse) and further improving HA/DR deployment on Kubernetes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems SQL Task scheduling Data Service Lisp

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.