From Keyword Matching to Semantic Understanding: Building an Intelligent E‑Commerce Search Engine
The article analyzes the semantic gap in e‑commerce search, compares traditional keyword matching with vector‑based retrieval, and provides a step‑by‑step implementation using Elasticsearch/Easysearch pipelines, embedding models, and a hybrid search strategy to improve user intent understanding.
1. Problem Origin
When optimizing e‑commerce search, queries like “high‑cost‑performance phone” only return items whose title contains the term “性价比”, missing many relevant products. Similarly, a query such as “gift for girlfriend” yields no results. The core issue is the semantic gap between diverse user expressions and product descriptions.
2. Problem Analysis
Traditional search relies on inverted indexes and relevance scoring (TF‑IDF, BM25). It works for exact queries such as “iPhone 15 Pro 256G” but fails for intent‑driven queries like “affordable headphones for students” or “smartphone for seniors”. User‑behavior analysis shows three patterns: multiple attempts after no result, frequent use of vague adjectives (e.g., “好用的”, “性价比高的”), and a shift toward natural‑language queries.
Technically, keyword matching is strict, lacks semantic understanding, and cannot handle context; synonym expansion is only a temporary fix.
3. Solution Exploration
After research, a vector‑based semantic search was chosen. Three architectural options were compared:
Self‑built vector DB (Milvus/Pinecone) – professional but high O&M cost and hard to integrate with ES.
Dedicated vector search service – data migration and vendor lock‑in.
ES 8.x built‑in vector search or Easysearch’s new vector feature – lowest migration cost, mature algorithms, easy hybrid with keyword search.
The third option was selected for its compatibility with the existing Easysearch stack.
4. Practical Implementation
4.0 Choose External Embedding Model
Tested OpenAI (good English, average Chinese), Baidu Wenxin (good Chinese but expensive), and finally selected Alibaba Cloud text-embedding-v4 for strong Chinese performance, stable API and controllable cost.
Architecture uses a dual pipeline: Ingest Pipeline generates vectors on write; Search Pipeline vectorizes the query at search time. Business code remains unchanged.
4.1 Create Ingest Pipeline
PUT _ingest/pipeline/product-embedding-aliyun
{
"description": "商品文本向量化管道",
"processors": [
{
"text_embedding": {
"url": "https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",
"vendor": "openai",
"api_key": "sk-XXXXXXXXXXXXXXXXX",
"text_field": "product_description",
"vector_field": "product_vector",
"model_id": "text-embedding-v4",
"dims": 256,
"batch_size": 10
}
}
]
}Explanation: text_field points to the description, vector_field stores a 256‑dim vector, batch size 10 improves throughput.
4.2 Create Index and Mapping
PUT /ecommerce-products
{
"mappings": {
"properties": {
"product_id": {"type": "keyword"},
"product_name": {"type": "text", "analyzer": "ik_max_word"},
"product_description": {"type": "text", "analyzer": "ik_max_word"},
"category": {"type": "keyword"},
"price": {"type": "double"},
"brand": {"type": "keyword"},
"product_vector": {
"type": "knn_dense_float_vector",
"knn": {
"dims": 256,
"model": "lsh",
"similarity": "cosine",
"L": 99,
"k": 1
}
}
}
}
}Key points: dense vector field, LSH algorithm (L=99, k=1) gave best performance in internal tests.
4.3 Bulk Import Data
POST /_bulk?pipeline=product-embedding-aliyun&refresh=wait_for
{ "index": {"_index":"ecommerce-products","_id":"1"} }
{ "product_name":"Apple iPhone 15 Pro","product_description":"Apple iPhone 15 Pro 256GB ...","category":"手机","price":8999,"brand":"Apple" }
{ "index": {"_index":"ecommerce-products","_id":"2"} }
{ "product_name":"华为MateBook X Pro","product_description":"华为MateBook X Pro 2024款 ...","category":"笔记本","price":7999,"brand":"华为" }
{ "index": {"_index":"ecommerce-products","_id":"3"} }
{ "product_name":"小米米家电动牙刷","product_description":"小米米家电动牙刷 ...","category":"个护健康","price":199,"brand":"小米" }4.4 Configure Search Pipeline
PUT /_search/pipeline/semantic_search_aliyun
{
"request_processors": [
{
"semantic_query_enricher": {
"tag": "product_semantic_search",
"description": "商品语义搜索向量化处理器",
"url": "https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",
"vendor": "openai",
"api_key": "sk-XXXXXXXXXXXXXX",
"default_model_id": "text-embedding-v4",
"vector_field_model_id": {"product_vector":"text-embedding-v4"}
}
}
]
} PUT /ecommerce-products/_settings
{
"index.search.default_pipeline": "semantic_search_aliyun"
}4.5 Execute Semantic Search
GET /ecommerce-products/_search
{
"_source": ["product_name","product_description","price","brand"],
"query": {
"semantic": {
"product_vector": {
"query_text": "商务人士适合的笔记本",
"candidates": 20,
"query_strategy": "LSH_COSINE"
}
}
},
"size": 10
}The query returns items that contain semantically related terms such as “商务办公” or “轻薄便携” even if the exact phrase “商务人士” is absent.
Pure semantic search can over‑generalize; for precise product names like “iPhone 15 Pro”, keyword matching remains more accurate. Mixing both requires a separated hybrid approach: run keyword and semantic queries independently, then merge results at the application layer. Directly mixing them in a bool query leads to a null_pointer_exception error.
5. Conclusion
Generating vectors in an ingest pipeline is an elegant design that decouples vector creation from query time. Semantic search endows the engine with true intent understanding, improving user experience and business value, while a hybrid strategy mitigates its limitations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
