Evolution of JD VOP Message Warehouse: From V1.0 to V3.0+ with Database Sharding and Performance Optimization
This article details the architectural evolution of JD's VOP message warehouse, describing the challenges of massive data volumes, the transition from V1.0 to V3.0+ through database sharding, MongoDB adoption, traffic governance, stability improvements, and cost reduction strategies, while presenting performance metrics and future outlook.
Introduction VOP, JD's enterprise API platform, aims to digitize procurement and improve cost efficiency. The message warehouse is a core component handling over 200 internal message sources and 80+ external APIs across product, order, logistics, and after‑sale scenarios.
Client Call Scenario Clients pull product change messages via the API, synchronize local catalogs, and delete processed messages in a periodic loop.
Message Warehouse V1.0 Early architecture faced database bottlenecks under high read/write concurrency, leading to latency, limited TPS, and capacity constraints (billions of rows, >10 GB).
Read‑write splitting reduced load but introduced master‑slave delay, limited slave size, and high TPS issues.
Message Warehouse V2.0 Adopted database sharding and partitioning to overcome read‑write split limits. Routing decisions are based on ducc and clientId hashes; old and new databases are consulted sequentially for reads, and ID thresholds determine write/delete targets.
Sharding eliminated single‑master bottlenecks, allowed unlimited horizontal scaling, and mitigated master‑slave latency by using multi‑master clusters.
Identified Pain Points
Massive data growth and uneven storage duration (2‑3 days → 7 days) causing traffic spikes.
Field expansion leading to large JSON payloads and schema‑change difficulty.
High‑availability and scalability constraints of the monolithic design.
High operational cost due to diverse client environments and lack of audit data.
Goal Build a reusable, scalable enterprise message center with high availability, low cost, high throughput, and seamless migration without data loss.
Solution Analysis Evaluated two storage options: MySQL + Elasticsearch vs. MongoDB.
Storage Cost : MongoDB’s compression and lack of redundancy reduce total data size by >50 % compared with MySQL + ES.
Development & Ops Cost : MongoDB eliminates data sync and DDL risks; scaling is dynamic and painless, whereas MySQL requires careful hash‑consistent migrations.
Performance : Under a 4C8G benchmark, MySQL and MongoDB have similar write throughput; MongoDB achieves ~30 k QPS reads versus ~6 k QPS for MySQL and ~800 QPS for ES.
Message Warehouse V3.0 Adopted a MongoDB sharded cluster as the primary store, complemented by Elasticsearch for audit queries. Architecture is divided into four stages:
Message Reception (vop‑worker) : Ingests ~100 internal sources, filters, cleans, and packages messages.
Message Transit (JMQ cluster) : Prioritizes messages into four levels to protect high‑priority traffic.
Message Write (vop‑msg‑store) : Dual‑writes to MongoDB (handling >5 × 10⁸ daily rows, 10k+ TPS) and ES for auditability.
Message Visualization (vop‑support‑platform) : Provides dashboards for operational insight and capacity planning.
MongoDB sharding eliminates single points of failure via multiple mongos routers and replicated config servers.
Results of V3.0+
Supports 5 × 10⁸ daily writes, 20 k TPS writes, 10 k QPS reads.
TP99 latency improved from 100 ms to 40 ms.
Data retention extended from 7 days to 45 days.
Operational cost remained flat while performance scaled.
Visualization dramatically increased ops efficiency.
Traffic Governance Reduced peak load by trimming billions of messages, applying client‑side caching, loading‑cache for inactive customers, and deduplication filters with time‑window control.
System Stability Optimized code paths (set‑based lookups, async processing, batch writes), introduced proactive degradation queues for hot‑shard tenants, and tuned JMQ consumer threads.
Cost Reduction Implemented serverless auto‑scaling based on message‑receive thresholds and CPU usage, cutting resource cost by 52 % during non‑peak periods.
Conclusion The message warehouse has matured through four major releases, handling major sales events with stable performance (e.g., Double‑11 2022: 20 k write QPS, 43 k read QPS). Future work focuses on continued craftsmanship, data‑driven improvements, further message‑data lifecycle optimization, and standardizing push mechanisms for real‑time bidirectional exchange.
Outlook Emphasizes long‑term scalability, continuous performance monitoring, and innovative solutions to keep the architecture aligned with growing business demands.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.