Artificial Intelligence 14 min read

Architecture and Techniques of an E‑commerce Search Engine

The article explains the overall architecture of an e‑commerce search engine, covering indexing, static scoring, retrieval, title and store deduplication, query analysis and rewriting, and related big‑data and AI techniques used to improve relevance and diversity of search results.

Architect
Architect
Architect
Architecture and Techniques of an E‑commerce Search Engine

Building on a previous engineering overview, the e‑commerce search engine consists of three main components: a Hadoop cluster for large‑scale and real‑time indexing, an Elasticsearch cluster providing distributed search, and an advanced search cluster that adds commercial features.

The indexing pipeline creates an inverted index from raw product data, computes a static score (Tscore) analogous to PageRank, pre‑calculates pairwise product similarity, shards data based on similarity, and finally builds the Elasticsearch index. The steps are: (1) compute each doc's static score, (2) compute pairwise similarity, (3) shard data, (4) create ES index.

During retrieval, the engine processes the user query, rewrites it, searches Elasticsearch, and combines dynamic relevance (Dscore) with static importance (Tscore) using the formula Score = Dscore * Tscore . The retrieval steps are: (1) query analysis, (2) query rewrite, (3) ES search, (4) combined ranking, (5) post‑ranking, (6) return results.

Static product scoring uses three factors—order count, positive rating, and shipping speed—combined as Tscore = a * f(order) + b * g(rating) + c * h(speed) . Log transformation and z‑score normalization are applied to standardize metrics, and weights a, b, c are tuned either by expert judgment or experimental A/B testing.

Title deduplication employs a bag‑of‑words vector representation and cosine similarity (1‑cosine) to detect near‑duplicate titles. Two scalable approaches are presented: (1) Spark matrix operations using rddRows = sc.parallelize([...]) and mat.columnSimilarities() , and (2) a MapReduce linear method that builds an inverted index, generates candidate pairs, and computes similarity scores such as 2/(len(doc1)*len(doc2))^0.5 = 0.7 . Titles with similarity above a threshold are considered duplicates, and the higher‑scoring product becomes the primary document.

Store deduplication differs by aiming to avoid dominance of a single store. A bucket‑based strategy partitions search results into multiple buckets (e.g., four buckets for 20 results per page) so that each store's products appear in only one bucket, ensuring balanced exposure across stores.

Query analysis and rewriting include core‑word detection, synonym expansion, and brand recognition. Synonym expansion builds a weighted graph from user session logs (e.g., "苹果手机" ↔ "iphone" with weight 0.8). Elasticsearch's BoostingQuery is used to boost original and synonym queries, as shown in the JSON example:

{
  "query": {
    "should": [
      { "match": { "content": { "query": "苹果手机", "boost": 10 } } },
      { "match": { "content": { "query": "iphone", "boost": 8 } } },
      { "match": { "content": { "query": "iphone6", "boost": 5 } } }
    ]
  }
}

Additional advanced techniques such as category taxonomy construction (using machine learning) and personalization via user profiling are mentioned as ongoing work.

In summary, the article demonstrates how a combination of big‑data processing, AI‑driven relevance scoring, deduplication strategies, and query rewriting can create an effective, scalable e‑commerce search solution.

e-commerceBig Datamachine learningsearch enginerankingDeduplicationquery rewriting
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.