Scaling Search for 11.11: Distributed Engine, Smart Routing & Auto‑Scaling
This article explains how a major e‑commerce platform built a horizontally scalable distributed search engine, designed efficient sharding and routing strategies, and implemented automated deployment, rapid scaling, and real‑time monitoring to handle the massive traffic of the 11.11 shopping festival.
“11.11” is the annual e‑commerce mega‑sale, demanding massive traffic handling and rapid response. The search team at the platform built a distributed search engine to meet two core requirements: scalability and fast response.
Distributed Search Engine
The engine is based on Lucene/Solr and a custom SOA framework (Hedwig), providing request distribution, result merging, and a management console for multi‑index, cluster, full‑index switching, and real‑time updates.
Instead of using off‑the‑shelf SolrCloud or Elasticsearch, the team chose a custom solution for three reasons:
ElasticSearch/SolrCloud treat the engine as a black box, while the platform needs complex query logic, custom plugins, and deep performance tuning.
Integrating those products with internal release, monitoring, and SOA systems would be time‑consuming.
By selectively incorporating Lucene/Solr components, the team reduces maintenance risk and gains deeper understanding for future customization.
Distributed search solves index growth and latency by splitting a single index into multiple shards, each small enough to be deployed on separate servers, enabling fast query responses.
The architecture consists of two main components: Shard Searcher (similar to a single‑node engine) and Broker (which splits a global query into shard‑level queries, dispatches them, and merges results). Resource usage depends heavily on the number of shards accessed and results returned; techniques like limiting shards per query or using DeepPaging can mitigate memory and CPU pressure.
Deployment options for Broker and Shard Searcher include:
Both on every node (A).
Broker as a separate cluster, isolated from Shard Searchers (B).
Broker embedded in the client application alongside the search logic (C).
The team initially used option A and later switched to C to reduce network calls and leverage client‑side routing logic.
Efficient Routing
Effective routing requires understanding business scenarios. For example, books are long‑tail items with many SKUs but low traffic, while fast‑moving consumer goods have few SKUs but high traffic. Therefore, simple modulo‑based sharding is unsuitable.
Routing strategy:
Shard by top‑level category.
If a shard becomes too large, further split by second‑level category.
If shards become too small, merge based on relevance.
Resulting shard sizes are typically 200–500 MB, with per‑shard query latency under 10 ms.
Two routing scenarios:
Category navigation: route directly using category ID, caching the routing table in Broker memory.
Search‑term queries: hot terms (top 20% of traffic) get pre‑computed routing tables; cold terms are routed on first search and then cached.
A 360‑Routing rule was added to prioritize shards that contributed to the top five result pages (covering >80% of traffic), further reducing the number of shards accessed per query.
Automatic Deployment and Fast Scaling
Sharding and routing introduce uneven shard sizes and traffic, requiring sophisticated deployment planning:
Determine replica count based on traffic and high‑availability needs.
Distribute replicas so that no node holds multiple replicas of the same shard and traffic load matches node capacity.
Balance total index size across nodes.
Replica calculation based on traffic proportion is illustrated in the following diagram:
After deployment, the system looks as shown below:
A packing algorithm automates the generation of deployment plans, handling node addition, removal, or replacement. During the 11.11 event, the tool enables rapid scaling: the Index Server computes a new plan, writes it to ZooKeeper, each Shard Searcher watches for changes, updates its local configuration, and fetches updated index data from HDFS to resume service.
Real‑time Monitoring System
During the promotion, minute‑level monitoring tracks both index preparation (data counts, processing exceptions, result consistency) and search service health (result drift alerts, stale product visibility). Automated alerts trigger index updates to demote sold‑out items.
Conclusion
Each year the 11.11 event serves as a stress test; by building a distributed search engine, designing business‑aware routing, automating deployment and scaling, and establishing a comprehensive validation and monitoring pipeline, the platform confidently handles massive traffic spikes and ensures a smooth shopping experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
