Big Data 7 min read

How We Built a Scalable Dump Index Architecture for 60M Users and 1.3B Products

Facing the challenges of searching across 60 million users and 1.3 billion products, Weidian’s engineering team designed a dump‑based indexing pipeline—Ergate—that consolidates, transforms, version‑controls, and monitors data from MySQL to HBase, enabling fast, flexible, and reliable search across massive datasets.

Weidian Tech Team

Feb 24, 2017

How We Built a Scalable Dump Index Architecture for 60M Users and 1.3B Products

Introduction

The Weidian tech team underwent massive changes over the past year; projects were built from scratch and technical solutions evolved from concept to production, leveraging 60 million users and 1.3 billion products as raw material. Providing effective indexing and retrieval for this data became critical.

1. Why a dump index is needed

Many scenarios require filtering massive data with complex conditions and sorting results. Weidian stores raw data in MySQL with sharding, but direct MySQL queries suffer from:

Data spread across many databases and tables, requiring multiple queries.

Complex MySQL queries cannot meet performance needs.

Lack of fuzzy matching capabilities.

Difficulties implementing personalized ranking.

Therefore, many products turn to search engines, whose index data is built from MySQL dumps.

2. What the dump must do

The dump process does more than sync data; it handles association, transformation, filtering, version control, reconciliation, and compensation.

Associated data

The dump outputs a single index table containing all data required for a dimension or business, e.g., product base data, extensions, categories, and possibly shop information. These pieces reside in separate MySQL tables and must be joined via keys—the core of the dump workflow.

Transformation & filtering

During dumping, data may be transformed (e.g., converting price from yuan to cents) and filtered (e.g., removing prohibited products).

Version control

To keep index data consistent with source data, a version number is used; newer messages overwrite older ones, ensuring final index matches the database.

Reconciliation & compensation

A robust system includes monitoring that reconciles index data with source data and triggers compensation and alerts when inconsistencies are detected.

Initial dump architecture diagram:

3. Platformization and Standardization

After the first version, several issues emerged:

Each new dump request required provisioning a new machine, wasting resources.

The dump development process was cumbersome and error‑prone.

Separate pipelines for full and incremental indexes prevented code reuse.

Version control was tied to record dimensions, affecting the entire workflow.

To address these, the Dump Center platform “Ergate” (工蚁) was created, aiming for standardized, accurate, and efficient data movement.

Ergate architecture diagram:

Ergate’s new features:

Fast development: dump jobs can be created via a form in minutes.

Resource saving: new dump services are added by submitting an MR‑job, without new machines.

Data accuracy: standardized workflow ensures correct results when configured properly.

Easy extensibility: modular design supports input sources such as MySQL, HDFS, VSS, HBase, and currently outputs vsearch index files, with future support for Elasticsearch.

Because HBase natively supports per‑field timestamps, Ergate solves version‑control issues without developers needing to manage versions manually.

4. Future Outlook

Ergate is positioned as a universal dump solution platform. The first version is live, and future directions include:

Business: promote the platform across all Weidian teams as a unified solution for both batch (full) and streaming (incremental) data.

Technical: use Spark to accelerate full‑index computation and JStorm to improve real‑time performance and throughput of incremental indexing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HBase data indexing Platformization dump architecture

Written by

Weidian Tech Team

The Weidian Technology Platform is an open hub for consolidating technical knowledge. Guided by a spirit of sharing, we publish diverse tech insights and experiences to grow and look ahead together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.