How Multimodal Product Search Works: Embedding, Vector Retrieval, and Elasticsearch Serverless
This article explores the evolution from keyword to multimodal search, detailing a generic solution architecture, core embedding and vector retrieval technologies, and practical implementation using Alibaba Cloud AI Search Open Platform and Elasticsearch Serverless.
Multimodal Product Search Architecture
The system processes structured product metadata (titles, descriptions, categories, tags) and unstructured images. Text data is tokenized and indexed in a traditional text engine. Images are passed through a multimodal AI model (e.g., M2‑Encoder, Qwen2‑VL) to generate descriptive captions, which are then tokenized and indexed alongside the metadata.
Embedding Layer
Both text and image data are transformed into high‑dimensional vectors. Dense models (Word2Vec, S‑BERT, LLM‑based) produce dense vectors that capture semantic similarity, while sparse models (BM25, SPLADE) generate high‑dimensional sparse vectors for exact term matching. A hybrid approach can combine dense and sparse representations for optimal recall and precision.
Query & Retrieval Layer
Text query : matched against the text engine (keyword match) and also encoded into a query vector for similarity search in the vector engine.
Image query : the uploaded image is encoded into a vector and used for nearest‑neighbor search.
Results from both paths are re‑ranked and merged using Reciprocal Rank Fusion (RRF) before returning the top‑N items.
Key Technologies
Embedding (Vectorization)
Embedding converts unstructured data into structured vectors. Dense vectors capture deep semantics; sparse vectors preserve exact keyword matches. Hybrid models generate both representations for each document.
Vector Retrieval
Similarity is measured with Euclidean distance, dot product, or cosine similarity. Elasticsearch supports dense_vector, sparse_vector, and semantic_text field types, native K‑Nearest Neighbor (KNN) search, hybrid (text + vector) queries, and quantization.
Quantization for Performance
Scalar Quantization (SQ) maps float32 vectors to int8 or int4, reducing memory usage by up to 75 %. Better Binary Quantization (BBQ) can cut memory by up to 95 % with modest recall loss, enabling billion‑scale vector search.
Implementation on Alibaba Cloud
AI Search Open Platform : extracts product data from RDS, runs multimodal models to generate unified vectors, and provides pipelines to ingest vectors into Elasticsearch.
Elasticsearch Serverless : a fully managed, auto‑scaling Elasticsearch service that supports dense/sparse vectors, native KNN, hybrid search, and quantization (int8, BBQ). It offers zero‑ops scaling, built‑in monitoring, and an Inference API for real‑time model integration.
Data Flow
Extract product records (ID, text, image) from RDS.
Use AI Search Open Platform to generate multimodal embeddings.
Ingest text fields and vector fields into Elasticsearch Serverless indices.
At query time, route user text or image to the same embedding service, obtain a query vector, and perform simultaneous text match and vector KNN search.
Merge results with RRF and return the top‑N hits.
Elasticsearch Vector Features
dense_vector– stores dense embeddings. sparse_vector – stores high‑dimensional sparse embeddings. semantic_text – automatically maps text to vectors via configured inference models.
Native KNN API for dense_vector fields.
Hybrid search combining match queries with KNN.
RRF fusion for robust ranking across heterogeneous result sets.
Performance Optimizations
Scalar Quantization : reduces float32 (4 bytes) to int8 (1 byte) or int4 (0.5 byte) per dimension.
BBQ : further compresses vectors, achieving up to 95 % memory reduction; recall loss can be mitigated by increasing num_candidates in KNN search.
HNSW graph indexing combined with quantization enables large‑scale retrieval with modest hardware.
Serverless Advantages
Zero operational overhead – no cluster management, automatic version upgrades, and built‑in monitoring.
Pay‑as‑you‑go CU‑based billing with second‑level granularity.
Automatic scaling and resource auto‑tuning based on traffic.
Seamless integration of built‑in AI models via Inference API; custom models can be added through simple API configuration.
Vector‑specific optimizations: source field exclusion for vectors, one‑click int8/BBQ quantization, adaptive pre‑warming of HNSW indexes.
This architecture enables a cost‑effective, high‑performance multimodal product search system that leverages modern embedding techniques, efficient vector retrieval, and serverless cloud infrastructure.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
