How to Build a Multimodal Product Search System with Embedding and Vector Retrieval
This article presents a comprehensive, end‑to‑end solution for multimodal product search, detailing the evolution from keyword to image‑based queries, the core embedding and vector retrieval technologies, practical Elasticsearch Serverless integration, quantization methods, and a complete demo workflow for building a high‑performance, low‑cost search platform.
Multimodal Product Retrieval Overview
Modern e‑commerce search increasingly requires multimodal and cross‑modal queries that can understand both images and complex textual descriptions. Traditional keyword‑only search cannot capture visual attributes or nuanced product features.
Solution Architecture
The solution consists of three layers: Data Processing, Query & Recall, and Fusion & Ranking.
Data Processing Layer
Product Metadata Processing (Structured Text)
Data Sources : product titles, descriptions, categories, tags, etc.
Processing Flow : tokenization and indexing into a traditional text engine.
Image Data Processing (Unstructured Data)
Content Understanding : a multimodal model generates a descriptive caption for each image (e.g., "green shorts with a cartoon dinosaur").
Embedding : a CNN‑based encoder (e.g., ResNet) transforms the image into a high‑dimensional vector stored in a vector engine.
The resulting product representation consists of a text document in the text engine and a vector in the vector engine.
Query & Recall Layer
Text Query
Keyword Match : direct matching in the text engine.
Semantic Match : the query text is embedded and matched against stored vectors.
Image Query
The uploaded image is encoded into a query vector.
Vector similarity search retrieves visually similar products.
Fusion & Ranking Layer
Rerank & Score : combines text relevance scores and vector similarity scores.
Result Return : the top‑N results are presented to the user.
Embedding Techniques
Text Embedding
Dense Models (e.g., Word2Vec, S‑BERT, LLM‑based encoders): produce dense vectors that capture deep semantic similarity.
Sparse Models (e.g., BM25, SPLADE): generate sparse vectors emphasizing exact term matching.
Hybrid Models : combine dense and sparse representations for the best of both worlds.
Image Embedding
Deep convolutional neural networks (e.g., ResNet) extract visual features and map images to low‑dimensional vectors that encode visual semantics.
Vector Retrieval
Vector retrieval finds the K‑nearest neighbors (KNN) of a query vector in a high‑dimensional space.
Euclidean Distance (L2 Norm) : smaller distance means higher similarity. Normalized score: 1 / (1 + L2_norm^2).
Dot Product : equivalent to cosine similarity when vectors are normalized.
Cosine Similarity : measures angular similarity, ranging from -1 to 1.
Elasticsearch Vector Support
Field Types : dense_vector for dense vectors, sparse_vector for sparse vectors, semantic_text for auto‑mapped vectors.
Inference API : calls external AI models (e.g., embedding models) during indexing or querying.
Ingest Pipeline : text_embedding or inference processors convert text fields to vectors on the fly.
KNN Search : native approximate nearest neighbor API on dense_vector fields.
Hybrid Search : combines match queries with KNN vector search; scores are fused using Reciprocal Rank Fusion (RRF) for robust ranking.
Performance Optimization – Quantization
Scalar Quantization (SQ) : maps float32 values to int8 (1 byte) or int4 (0.5 byte) per dimension, achieving 1/4 or 1/8 of the original size.
Better Binary Quantization (BBQ) : further compresses vectors, reducing memory usage by up to 95% while maintaining acceptable recall.
Example: a dataset of 100 billion 1024‑dimensional float32 vectors (~37 TB) can be reduced to ~1.8 TB with BBQ + HNSW indexing, cutting required compute nodes from 170 to 9.
Best Practice with AI Search Open Platform & Elasticsearch Serverless
Overall Architecture
Data Source : product data stored in RDS (ID, text description, image URL).
AI Search Open Platform :
Offline data service extracts data from RDS.
Multimodal vector service calls built‑in AI models (e.g., M2‑Encoder, Qwen2‑VL) to generate unified multimodal vectors.
Processed text and vectors are written to Elasticsearch Serverless.
Elasticsearch Serverless stores both text and vector indexes and handles online queries.
Online Query Flow :
User submits a text or image query.
The query is vectorized by the AI Search platform.
Vectorized query is sent to Elasticsearch Serverless for multi‑path recall (text + vector).
Top‑N results are returned to the user.
AI Search Open Platform Features
Supports various data sources (OSS, MySQL, Hudi, Iceberg, MaxCompute).
Provides micro‑services for document parsing, multimodal parsing, vectorization, reranking, LLM inference, and agent capabilities.
Integrates with LangChain, LlamaIndex, and vector databases such as Milvus, Havenask, and Elasticsearch.
Elasticsearch Serverless Advantages
Zero Operations : fully managed proxy hides cluster details, automatic version upgrades.
Cost‑Effective : pay‑per‑use (CU) billing with second‑level granularity.
High Elasticity : automatic scaling and resource auto‑adjustment based on load.
Seamless AI Model Integration : built‑in models via Inference API and support for custom external models.
Vector Optimizations : intelligent vector field filtering, default quantization (int8 or BBQ), and automatic pre‑warming of HNSW and quantized indexes.
Key Takeaways
The architecture combines precise text matching with semantic and visual similarity via vector search, leverages modern embedding models, and uses Elasticsearch’s native vector capabilities. Quantization techniques (SQ and BBQ) make large‑scale vector retrieval memory‑efficient, while AI Search Open Platform and Elasticsearch Serverless provide a fully managed, scalable, and cost‑effective production environment for multimodal product search.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
