How We Built a Scalable Dump Index Architecture for 60M Users and 1.3B Products
Facing the challenges of searching across 60 million users and 1.3 billion products, Weidian’s engineering team designed a dump‑based indexing pipeline—Ergate—that consolidates, transforms, version‑controls, and monitors data from MySQL to HBase, enabling fast, flexible, and reliable search across massive datasets.
Introduction
The Weidian tech team underwent massive changes over the past year; projects were built from scratch and technical solutions evolved from concept to production, leveraging 60 million users and 1.3 billion products as raw material. Providing effective indexing and retrieval for this data became critical.
1. Why a dump index is needed
Many scenarios require filtering massive data with complex conditions and sorting results. Weidian stores raw data in MySQL with sharding, but direct MySQL queries suffer from:
Data spread across many databases and tables, requiring multiple queries.
Complex MySQL queries cannot meet performance needs.
Lack of fuzzy matching capabilities.
Difficulties implementing personalized ranking.
Therefore, many products turn to search engines, whose index data is built from MySQL dumps.
2. What the dump must do
The dump process does more than sync data; it handles association, transformation, filtering, version control, reconciliation, and compensation.
Associated data
The dump outputs a single index table containing all data required for a dimension or business, e.g., product base data, extensions, categories, and possibly shop information. These pieces reside in separate MySQL tables and must be joined via keys—the core of the dump workflow.
Transformation & filtering
During dumping, data may be transformed (e.g., converting price from yuan to cents) and filtered (e.g., removing prohibited products).
Version control
To keep index data consistent with source data, a version number is used; newer messages overwrite older ones, ensuring final index matches the database.
Reconciliation & compensation
A robust system includes monitoring that reconciles index data with source data and triggers compensation and alerts when inconsistencies are detected.
Initial dump architecture diagram:
3. Platformization and Standardization
After the first version, several issues emerged:
Each new dump request required provisioning a new machine, wasting resources.
The dump development process was cumbersome and error‑prone.
Separate pipelines for full and incremental indexes prevented code reuse.
Version control was tied to record dimensions, affecting the entire workflow.
To address these, the Dump Center platform “Ergate” (工蚁) was created, aiming for standardized, accurate, and efficient data movement.
Ergate architecture diagram:
Ergate’s new features:
Fast development: dump jobs can be created via a form in minutes.
Resource saving: new dump services are added by submitting an MR‑job, without new machines.
Data accuracy: standardized workflow ensures correct results when configured properly.
Easy extensibility: modular design supports input sources such as MySQL, HDFS, VSS, HBase, and currently outputs vsearch index files, with future support for Elasticsearch.
Because HBase natively supports per‑field timestamps, Ergate solves version‑control issues without developers needing to manage versions manually.
4. Future Outlook
Ergate is positioned as a universal dump solution platform. The first version is live, and future directions include:
Business: promote the platform across all Weidian teams as a unified solution for both batch (full) and streaming (incremental) data.
Technical: use Spark to accelerate full‑index computation and JStorm to improve real‑time performance and throughput of incremental indexing.
Weidian Tech Team
The Weidian Technology Platform is an open hub for consolidating technical knowledge. Guided by a spirit of sharing, we publish diverse tech insights and experiences to grow and look ahead together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.