Artificial Intelligence 21 min read

How to Build a Multimodal Product Search System with Embedding and Vector Retrieval

This article presents a comprehensive, end‑to‑end solution for multimodal product search, detailing the evolution from keyword to image‑based queries, the core embedding and vector retrieval technologies, practical Elasticsearch Serverless integration, quantization methods, and a complete demo workflow for building a high‑performance, low‑cost search platform.

DataFunSummit

Mar 24, 2026

How to Build a Multimodal Product Search System with Embedding and Vector Retrieval

Multimodal Product Retrieval Overview

Modern e‑commerce search increasingly requires multimodal and cross‑modal queries that can understand both images and complex textual descriptions. Traditional keyword‑only search cannot capture visual attributes or nuanced product features.

Solution Architecture

The solution consists of three layers: Data Processing, Query & Recall, and Fusion & Ranking.

Data Processing Layer

Product Metadata Processing (Structured Text)

Data Sources : product titles, descriptions, categories, tags, etc.

Processing Flow : tokenization and indexing into a traditional text engine.

Image Data Processing (Unstructured Data)

Content Understanding : a multimodal model generates a descriptive caption for each image (e.g., "green shorts with a cartoon dinosaur").

Embedding : a CNN‑based encoder (e.g., ResNet) transforms the image into a high‑dimensional vector stored in a vector engine.

The resulting product representation consists of a text document in the text engine and a vector in the vector engine.

Query & Recall Layer

Text Query

Keyword Match : direct matching in the text engine.

Semantic Match : the query text is embedded and matched against stored vectors.

Image Query

The uploaded image is encoded into a query vector.

Vector similarity search retrieves visually similar products.

Fusion & Ranking Layer

Rerank & Score : combines text relevance scores and vector similarity scores.

Result Return : the top‑N results are presented to the user.

Embedding Techniques

Text Embedding

Dense Models (e.g., Word2Vec, S‑BERT, LLM‑based encoders): produce dense vectors that capture deep semantic similarity.

Sparse Models (e.g., BM25, SPLADE): generate sparse vectors emphasizing exact term matching.

Hybrid Models : combine dense and sparse representations for the best of both worlds.

Image Embedding

Deep convolutional neural networks (e.g., ResNet) extract visual features and map images to low‑dimensional vectors that encode visual semantics.

Vector Retrieval

Vector retrieval finds the K‑nearest neighbors (KNN) of a query vector in a high‑dimensional space.

Euclidean Distance (L2 Norm) : smaller distance means higher similarity. Normalized score: 1 / (1 + L2_norm^2).

Dot Product : equivalent to cosine similarity when vectors are normalized.

Cosine Similarity : measures angular similarity, ranging from -1 to 1.

Elasticsearch Vector Support

Field Types : dense_vector for dense vectors, sparse_vector for sparse vectors, semantic_text for auto‑mapped vectors.

Inference API : calls external AI models (e.g., embedding models) during indexing or querying.

Ingest Pipeline : text_embedding or inference processors convert text fields to vectors on the fly.

KNN Search : native approximate nearest neighbor API on dense_vector fields.

Hybrid Search : combines match queries with KNN vector search; scores are fused using Reciprocal Rank Fusion (RRF) for robust ranking.

Performance Optimization – Quantization

Scalar Quantization (SQ) : maps float32 values to int8 (1 byte) or int4 (0.5 byte) per dimension, achieving 1/4 or 1/8 of the original size.

Better Binary Quantization (BBQ) : further compresses vectors, reducing memory usage by up to 95% while maintaining acceptable recall.

Example: a dataset of 100 billion 1024‑dimensional float32 vectors (~37 TB) can be reduced to ~1.8 TB with BBQ + HNSW indexing, cutting required compute nodes from 170 to 9.

Best Practice with AI Search Open Platform & Elasticsearch Serverless

Overall Architecture

Data Source : product data stored in RDS (ID, text description, image URL).

AI Search Open Platform :

Offline data service extracts data from RDS.

Multimodal vector service calls built‑in AI models (e.g., M2‑Encoder, Qwen2‑VL) to generate unified multimodal vectors.

Processed text and vectors are written to Elasticsearch Serverless.

Elasticsearch Serverless stores both text and vector indexes and handles online queries.

Online Query Flow :

User submits a text or image query.

The query is vectorized by the AI Search platform.

Vectorized query is sent to Elasticsearch Serverless for multi‑path recall (text + vector).

Top‑N results are returned to the user.

AI Search Open Platform Features

Supports various data sources (OSS, MySQL, Hudi, Iceberg, MaxCompute).

Provides micro‑services for document parsing, multimodal parsing, vectorization, reranking, LLM inference, and agent capabilities.

Integrates with LangChain, LlamaIndex, and vector databases such as Milvus, Havenask, and Elasticsearch.

Elasticsearch Serverless Advantages

Zero Operations : fully managed proxy hides cluster details, automatic version upgrades.

Cost‑Effective : pay‑per‑use (CU) billing with second‑level granularity.

High Elasticity : automatic scaling and resource auto‑adjustment based on load.

Seamless AI Model Integration : built‑in models via Inference API and support for custom external models.

Vector Optimizations : intelligent vector field filtering, default quantization (int8 or BBQ), and automatic pre‑warming of HNSW and quantized indexes.

Key Takeaways

The architecture combines precise text matching with semantic and visual similarity via vector search, leverages modern embedding models, and uses Elasticsearch’s native vector capabilities. Quantization techniques (SQ and BBQ) make large‑scale vector retrieval memory‑efficient, while AI Search Open Platform and Elasticsearch Serverless provide a fully managed, scalable, and cost‑effective production environment for multimodal product search.

Quantization Elasticsearch HNSW Embedding vector retrieval multimodal search AI search platform

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.