Databases 14 min read

How JD.com Scales Billions of Product Reviews with MySQL, HBase, and Elasticsearch

This article explains how JD.com designs a multi‑layer data storage architecture—using MySQL for core metadata, HBase and MongoDB for comment text, Solr/Elasticsearch for indexing, and Redis for caching—to handle billions of daily review requests with high performance and availability.

dbaplus Community
dbaplus Community
dbaplus Community
How JD.com Scales Billions of Product Reviews with MySQL, HBase, and Elasticsearch

Overall Architecture

JD.com processes tens of billions of product reviews and generates billions of service calls per day. The storage subsystem is split into four layers, each tuned for a specific workload:

Basic data storage (MySQL)

Text storage (MongoDB → HBase)

Data indexing (Solr for the frontend, Elasticsearch for the backend)

Data caching (Redis)

Basic Data Storage (MySQL)

MySQL holds only the non‑textual metadata of a review: status, user ID, timestamps, image references, tags, likes, etc. To cope with the volume, tables are sharded by user_id. Image and tag tables are further split by product_id within the same database. Small auxiliary tables that receive low traffic remain in a single unsharded database.

Text Storage (MongoDB → HBase)

Full review bodies are stored in a NoSQL layer. The team first evaluated Cassandra (write‑heavy) and MongoDB (balanced read/write, distributed file storage). MongoDB was adopted for several years because of its flexibility. As data volume grew, HBase was introduced for its linear scalability and strong reliability. The review identifier becomes the HBase RowKey, enabling direct Get operations. Optional redundant columns can store additional fields to avoid a second lookup in MySQL.

Data Indexing

Reviews must be searchable by both user and product dimensions. Because MySQL shards by user_id, product‑centric queries cannot be satisfied directly. JD.com therefore adds two Lucene‑based search services:

Frontend search cluster – Solr (SolrCloud) shards data by product_id and provides faceted statistics (e.g., count of 1‑star to 5‑star reviews) in milliseconds, eliminating costly GROUP BY in MySQL.

Backend search cluster – Elasticsearch handles keyword highlighting, multi‑field fuzzy queries, and full‑field retrieval.

Index design distinguishes stored fields (e.g., the primary key review_id) from indexed‑only fields . Stored fields are kept in the index document so the original value can be returned without a database round‑trip; indexed‑only fields occupy less space and are used solely for search. For large text fields, only indexing (no storing) is recommended.

Typical faceted query (Solr syntax) used by the frontend:

q=*:*&fq=product_id:12345&facet=true&facet.field=rating&facet.mincount=1

This single request returns the count of each rating value for the specified product.

Data Caching (Redis)

Redis is deployed in multiple clusters, each dedicated to a specific business use case and replicated across data centers with master‑slave topology. Common caching patterns:

When a new comment is posted, prepend it to the cached first‑page list so the front‑page reflects the update instantly.

Persist comment counts in Redis; updates are applied asynchronously, allowing the front‑end to serve counts without hitting MySQL.

Cache the first page of comment listings; asynchronous jobs refresh the cache when underlying data changes.

Pre‑warm cache for flash‑sale items and compress keys/text to reduce memory footprint.

Disaster Recovery & High Availability

All layers run in multi‑datacenter master‑slave configurations:

MySQL : each shard is replicated to a secondary site. Read‑write separation directs reads to the local slave, while writes go to the master. Automatic failover promotes a slave when the master is unavailable.

HBase : deployed on JD.com’s public cluster with active‑standby across two sites. If the primary cluster fails, writes fall back to Redis and MongoDB; a background job later synchronises the data back to HBase.

Search clusters : at least three replica nodes are spread across data centers. The primary node handles index updates; any replica can be promoted to primary on failure.

Redis : master handles writes, slaves serve reads. Failure of the master triggers a promotion of a nearby slave and re‑establishment of the replication topology.

This design ensures continuous service despite node‑ or site‑level outages.

Key Takeaways

The architecture evolves pragmatically to address concrete pain points: massive data volume, low‑latency reads, and high availability. No single storage technology is universally optimal; the choice is driven by current workload characteristics and must be revisited as business requirements change.

Illustrative Diagrams

Architecture diagram
Architecture diagram
Multi‑datacenter deployment
Multi‑datacenter deployment
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchredismysqlHBaseComment SystemJD.com
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.