Artificial Intelligence 22 min read

How to Build a Multimodal Product Search Engine with Embedding and Vector Retrieval on Elasticsearch Serverless

This article explores the evolution of e‑commerce search toward multimodal and cross‑modal capabilities, outlines a generic architecture that combines text and image processing via embedding and vector retrieval, and demonstrates how to implement the solution using Alibaba Cloud's AI Search Open Platform and Elasticsearch Serverless with detailed guidance on models, similarity metrics, quantization, and performance optimization.

DataFunSummit

Mar 29, 2026

How to Build a Multimodal Product Search Engine with Embedding and Vector Retrieval on Elasticsearch Serverless

Overview

With the rapid development of artificial intelligence, user expectations for search have moved beyond simple keyword matching to multimodal and cross‑modal queries that can understand images and complex natural‑language descriptions. Traditional text‑only search struggles with visual elements and nuanced product attributes.

Multimodal Product Search Solution

The solution consists of three layers: data processing, query & recall, and fusion & ranking.

1. Data Processing Layer

Text Metadata Processing : Extract structured fields such as title, description, category, and tags from the product database, tokenize the text, and build an inverted index in a traditional text engine.

Image Data Processing : Use a multimodal large model to generate a descriptive caption for each image (e.g., "a green short‑sleeve shorts with a cartoon dinosaur pattern"). The caption is then tokenized and indexed like text.

Embedding (Vectorization) : Convert both text and image data into high‑dimensional vectors using embedding models. The resulting vectors are stored in a vector engine.

2. Query & Recall Layer

Text Query : The user's keyword query is matched against the text engine (exact match) and also transformed into a query vector via embedding for semantic matching in the vector engine.

Image Query : The uploaded image is encoded into a query vector and used for nearest‑neighbor search in the vector engine.

3. Fusion & Ranking Layer

Rerank & Score : Results from the text and vector engines are merged and re‑ranked using a Rerank module that combines textual relevance scores and vector similarity scores.

Result Return : The final top‑N results are presented to the user.

Key Technologies

Embedding (Vectorization)

Embedding maps unstructured data (text, images) into structured numeric vectors. Three model families are commonly used:

Dense Model (e.g., Word2Vec, S‑BERT, LLM‑based models): Produces dense vectors where most dimensions are non‑zero, capturing deep semantic relationships.

Sparse Model (e.g., BM25, SPLADE): Generates sparse vectors with few non‑zero entries, emphasizing exact term matching.

Hybrid Model : Combines dense and sparse vectors to achieve both semantic generalization and precise keyword matching, often yielding the best retrieval performance.

Vector Retrieval

Vector retrieval finds the K‑nearest neighbors (KNN) of a query vector in a high‑dimensional space. Common similarity measures include:

Euclidean Distance (L2) : Direct geometric distance; often normalized to a score using 1/(1+L2_norm^2).

Dot Product : Equivalent to cosine similarity when vectors are normalized; larger values indicate higher similarity.

Cosine Similarity : Measures the angle between vectors, ranging from -1 to 1.

Elasticsearch Vector Support

Elasticsearch now provides native vector fields and APIs: dense_vector: Stores dense vectors. sparse_vector: Stores high‑dimensional sparse vectors efficiently. semantic_text: Abstract type that automatically maps text to the appropriate vector representation via configured inference models.

Inference API : Calls external AI models (e.g., embedding models) during indexing or query time.

Ingest Pipeline : Uses processors like text_embedding or inference to convert fields to vectors on the fly.

KNN Search : Native approximate nearest‑neighbor search on dense_vector fields.

Hybrid Search : Executes both traditional match queries and KNN vector queries in a single request.

Reciprocal Rank Fusion (RRF) : Merges rankings from different recall sources without relying on raw scores, improving robustness.

Vector Quantization for Performance

To reduce memory consumption for large‑scale vector retrieval, quantization techniques are applied:

Scalar Quantization (SQ) : Converts 32‑bit float vectors to 8‑bit (or 4‑bit) integers, shrinking memory by up to 4× or 8×.

BBQ (Better Binary Quantization) : An advanced quantization method that can reduce memory usage by up to 95% with a modest impact on recall, enabling billion‑scale vector search.

Example: A dataset of 100 billion 1024‑dimensional float32 vectors (~37 TB) can be reduced to ~1.8 TB after applying BBQ and HNSW indexing, cutting the required compute nodes from 170 to 9.

Best Practice on Alibaba Cloud

Overall Architecture

Data resides in an RDS instance. An offline data service extracts product records, then the AI Search Open Platform's multimodal vector service (using models such as M2‑Encoder or Qwen2‑VL) converts text and images into unified vectors. The processed data is written to Elasticsearch Serverless, which stores both textual and vector indexes. Online queries from a front‑end are routed through the AI Search platform for vectorization, then sent to Elasticsearch Serverless for multi‑path recall, ranking, and final result delivery.

Alibaba Cloud AI Search Open Platform

The platform offers a one‑stop, enterprise‑grade AI search solution with layered architecture:

Data Source & Management : Supports OSS, MySQL, Hudi, Iceberg, MaxCompute, etc.

Core Services & Ecosystem : Provides document parsing, multimodal parsing, embedding, rerank, LLM inference, and LLM agents as modular micro‑services.

Open Frameworks : Seamlessly integrates with LangChain, LlamaIndex, and vector databases like Milvus, Havenask, and Elasticsearch.

Application Development & Deployment : Enables rapid development of AI search applications and supports deployment via Function Compute (FC), PAI, and other methods.

Elasticsearch Serverless

Serverless Elasticsearch abstracts cluster management, providing automatic version upgrades, resource scaling, and built‑in monitoring. Key benefits include:

Zero Operations : No need to manage nodes, shards, or capacity planning.

Pay‑as‑You‑Go : Billing by compute units (CU) per second.

Auto‑Scaling : Resources expand or shrink based on real‑time load.

Seamless AI Model Integration : Built‑in inference API can call Alibaba Cloud AI models or custom external models without additional code.

Vector Optimizations : Default exclusion of vector fields from _source, one‑click activation of int8 or BBQ quantization, and automatic pre‑warming of HNSW and quantized index files to reduce cold‑start latency.

Conclusion

The presented architecture demonstrates how to combine modern embedding techniques, efficient vector retrieval, and Alibaba Cloud's AI Search Open Platform with Elasticsearch Serverless to build a high‑performance, low‑cost multimodal product search system. By leveraging dense, sparse, and hybrid models, applying quantization, and using advanced ranking strategies such as RRF, developers can achieve accurate semantic search while handling massive data volumes.

AI Quantization Elasticsearch Embedding vector retrieval multimodal search

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.