How Youzan Built a Scalable E‑Commerce Search Engine with Elasticsearch and Hadoop

This article details Youzan's practical architecture, indexing pipeline, advanced search module, and performance‑tuning techniques for building a commercial e‑commerce search engine using Elasticsearch, Hadoop, Kafka, and custom backend integrations.

21CTO
21CTO
21CTO
How Youzan Built a Scalable E‑Commerce Search Engine with Elasticsearch and Hadoop

In this article the Youzan technical team shares the practical architecture and solutions they used to build a commercial e‑commerce search engine based on Elasticsearch and Hadoop.

1. Technical Architecture

Youzan’s search engine runs on a distributed real‑time Elasticsearch cluster built on the mature Lucene index library, offering multi‑tenant, high‑availability, horizontal scalability, automatic fault‑tolerance and auto‑scaling. The team also implemented seamless integration with MySQL and Hadoop and developed an advanced search module that provides a flexible relevance‑calculation framework.

2. Index Construction

Real‑time indexing is achieved through a queue‑oriented pipeline: data are first written to a DB or file, then synchronized to a Kafka topic (using tools such as MySQL‑replication, MyPipe or Alibaba Canal). Elasticsearch subscribes to the relevant topics to build indexes instantly. For file sources, Flume streams data into Kafka.

Full‑volume indexing is required in three typical scenarios: occasional data loss from real‑time updates, schema or analyzer changes, and cold‑start after a long‑running system. The team uses a Hadoop‑ES solution that leverages Hadoop’s distributed capabilities to create full indexes transparently.

Example Hive SQL that creates an external Elasticsearch index:

drop table search.goods_index;
CREATE EXTERNAL TABLE search.goods_index (
    is_virtual int,
    created_time string,
    update_time string,
    title string,
    tag_ids array<int>
) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES (
    'es.batch.size.bytes'='1mb',
    'es.batch.size.entries'='0',
    'es.batch.write.refresh'='false',
    'es.batch.write.retry.count'='3',
    'es.mapping.id'='id',
    'es.write.operation'='index',
    'es.nodes'='192.168.1.10:9200',
    'es.resource'='goods/goods');

The system maps Elasticsearch as a Hive external table, making index updates as simple as inserting rows into a Hive table.

Full‑volume indexing directly from databases or file systems is discouraged because it puts heavy load on production systems and lacks true horizontal scalability.

Kafka is chosen as the messaging backbone because of its high throughput and log‑compaction feature, which retains only the latest message per key, effectively providing a snapshot of the database and enabling real‑time incremental updates.

3. Advanced Search (AS) Module

The Advanced Search module extends Elasticsearch with proxy, plugin‑based relevance calculation, and a rich library of query‑rewriting, ranking and filtering algorithms. Business‑specific plugins implement custom rerank logic while reusing shared algorithm libraries, achieving strong extensibility and reuse.

AS also offers a query‑level cache and a buffering queue to prevent cascade failures during traffic spikes.

4. Elasticsearch Performance Optimizations

4.1 Application‑Level Queue to Prevent Snowball Effect

During peak traffic (e.g., flash sales), Elasticsearch thread pools can become saturated. The AS layer introduces per‑application queues (default length 50) that monitor average response time; if it exceeds 200 ms the queue pauses for 1 s, logs overflow, and triggers alerts after repeated slow responses.

4.2 Automatic Degradation

If a query becomes too slow, the system switches to a simplified “degraded” query that avoids costly relevance scoring, allowing the service to remain available without scaling.

4.3 Effective Use of Filtered Queries

Filters operate on bitsets in memory, which are orders of magnitude faster than full inverted‑index scoring. The article recommends using filtered queries whenever possible, reserving full queries only for scoring‑required cases.

query:
  filtered:
    query:
      title: "apple"
    filter:
      tag: "mac"
      region: "beijing"

4.4 Additional Optimizations

Disable automatic shard rebalancing in production and schedule manual rebalancing during off‑peak hours.

Increase the index refresh interval when real‑time freshness is not critical.

Remove the default _all field if it is not needed, cutting index size by roughly 50 %.

Physically separate slow‑query indexes from fast ones.

5. Summary

The article outlines Youzan’s search engine architecture, its Hadoop‑based full‑volume indexing pipeline, the extensible Advanced Search module, and several practical performance‑tuning techniques, providing general guidance for building a commercial e‑commerce search service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

indexingsearch engineElasticsearch
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.