From Text to Images: Building Multi‑Modal Product Search with Elasticsearch Serverless
The article walks through the evolution of e‑commerce search from simple keyword matching to multi‑modal retrieval, explains a generic architecture that fuses text and image embeddings, details core techniques such as dense, sparse and hybrid models, vector similarity metrics, quantization methods like SQ and BBQ, and demonstrates how Elasticsearch Serverless provides a server‑less, cost‑effective platform to implement the end‑to‑end solution.
Search Scenario Evolution
Traditional e‑commerce search relies on keyword matching, which cannot satisfy user needs such as searching by image (e.g., a unique hair‑dryer design) or retrieving visual attributes like color and pattern that are missing from textual titles.
General Multi‑Modal Product Retrieval Architecture
The solution consists of three layers:
Data Processing Layer
Text metadata processing : ingest structured fields (title, description, category, tags) from the product database, tokenize them, and build an inverted index in a text engine.
Image processing : use a multimodal large model to generate a descriptive caption for each image (e.g., "a green short‑sleeve children’s shorts with a cartoon dinosaur"), then treat the caption as additional text metadata.
Embedding step : convert both text and image data into high‑dimensional vectors using an embedding model and store the vectors in a vector engine.
Query & Retrieval Layer
Text query : match keywords in the text engine and also embed the query text to retrieve similar vectors.
Image query : embed the uploaded image into a query vector and perform nearest‑neighbor search in the vector engine.
Fusion & Ranking Layer
Combine results from the text engine and vector engine with a Rerank module that weighs textual relevance scores and vector similarity scores.
Use Reciprocal Rank Fusion (RRF) to merge rankings without being affected by the absolute score values.
Key Technology 1: Embedding (Vectorization)
Embedding maps unstructured data (text, image) to machine‑readable vectors.
Dense models (e.g., Word2vec, S‑BERT, LLM‑based encoders) produce dense vectors where most dimensions are non‑zero, capturing semantic similarity ("king" vs "queen").
Sparse models (e.g., BM25, SPLADE) generate sparse vectors with only a few non‑zero entries, preserving exact term matching.
Hybrid models combine dense and sparse vectors for the best of both worlds, achieving superior performance on benchmark tests.
Key Technology 2: Vector Retrieval
Vector retrieval finds the K nearest neighbors of a query vector in a high‑dimensional space.
Distance metrics : Euclidean distance (L2 norm) and cosine similarity (dot product after normalization). The article shows the conversion formula for Euclidean distance to a normalized score: 1 / (1 + L2_norm^2).
Elasticsearch support : field types dense_vector (stores dense vectors) and sparse_vector (stores sparse vectors). The semantic_text type can automatically map text to the appropriate vector type via an inference model.
Hybrid search : Elasticsearch can execute match (text) and KNN (vector) queries in a single request, then fuse the scores.
Performance Optimization: Quantization
Vector quantization reduces memory consumption for large‑scale vector stores.
Scalar Quantization (SQ) : converts 32‑bit float32 vectors (4 bytes per dimension) to 8‑bit int8 (1 byte) or 4‑bit int4, achieving a 1/4 or 1/8 memory reduction.
Better Binary Quantization (BBQ) : a technique from Nanyang Technological University that further compresses vectors, cutting memory usage by up to 95 % and making billion‑scale vector retrieval feasible. The trade‑off is a slight drop in recall, which can be mitigated by increasing the num_candidates parameter.
Impact example : a dataset of 100 billion 1024‑dimensional float32 vectors (~37 TB) can be reduced to ~1.8 TB after applying BBQ + HNSW indexing, shrinking required compute nodes from 170 to 9.
Best Practice with Elasticsearch Serverless
Elasticsearch Serverless is a fully managed, server‑less offering that abstracts clusters, nodes, and shards.
Zero‑ops : automatic version upgrades, security patches, and resource scaling.
Pay‑as‑you‑go : billing by Compute Units (CU) per second, eliminating fixed‑term contracts.
Auto‑scaling : resources expand or shrink based on real‑time load; index replicas and throttling thresholds are adjusted automatically.
AI integration : built‑in Inference API can call AI models (e.g., M2‑Encoder, Qwen2‑VL) directly from ES, and custom external models can be plugged in via API configuration.
Vector optimizations : default exclusion of vector fields from _source to save storage, one‑click enablement of int8 or BBQ quantization, and automatic pre‑warming of HNSW and quantized index files to reduce cold‑start latency.
Demo Overview
The end‑to‑end demo shows how product data stored in an RDS instance is extracted, processed by the AI Search Open Platform to generate multimodal vectors, written into Elasticsearch Serverless, and finally queried via a front‑end application that supports both text and image inputs. The demo highlights the seamless flow from data ingestion to vector indexing, retrieval, reranking, and result presentation.
Overall, the article demonstrates that by combining modern embedding techniques, efficient vector retrieval, and the operational simplicity of Elasticsearch Serverless, developers can build powerful, cost‑effective multimodal product search systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
