From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless
The article analyzes the shift from keyword‑based to multimodal e‑commerce search, outlines a generic architecture that combines text and image embedding with vector retrieval, and demonstrates how Elasticsearch Serverless and Alibaba Cloud AI Search platform enable a low‑cost, scalable, and high‑performance product search solution.
Search Scenario Evolution
Traditional keyword search struggles with queries that require visual understanding, such as finding a product from a photo of a unique hair‑dryer or capturing visual attributes like color and pattern that are missing from textual titles.
General Multimodal Product Search Architecture
The solution consists of three layers:
Data Processing Layer
Structured Text : Extract fields such as title, description, category, and tags from the product database and index them with a traditional text engine.
Unstructured Images : Use a multimodal large model to generate descriptive text for each image, then treat the description as additional structured data.
Embedding : Convert both text and image content into high‑dimensional vectors (dense or sparse) and store them in a vector engine.
Query & Retrieval Layer
Text Query : Perform exact keyword matching in the text engine and semantic matching by converting the query to a vector.
Image Query : Encode the uploaded image into a query vector and retrieve visually similar items from the vector engine.
Fusion & Ranking Layer Results from the text and vector engines are merged with a Rerank module that combines textual relevance scores and vector similarity scores. The final ordering can be refined with Reciprocal Rank Fusion (RRF) to balance the two sources.
Key Technologies
1. Embedding (Vectorization)
Embedding maps unstructured data to machine‑readable vectors. Three model families are discussed:
Dense Models (e.g., Word2Vec, S‑BERT, LLM‑based encoders) produce dense vectors that capture deep semantic similarity.
Sparse Models (e.g., BM25, SPLADE) generate high‑dimensional sparse vectors that retain exact term matching capabilities.
Hybrid Models combine dense and sparse representations to achieve both semantic generalization and precise keyword matching.
2. Vector Retrieval
Similarity is measured with distance metrics such as Euclidean distance, dot product, and cosine similarity. Elasticsearch now supports native dense_vector and sparse_vector field types, KNN search APIs, and hybrid search that runs match queries together with KNN in a single request.
3. Quantization for Performance
To reduce memory consumption of high‑dimensional float32 vectors, scalar quantization (SQ) maps values to int8 or int4. The BBQ (Better Binary Quantization) technique, originating from Nanyang Technological University, further compresses vectors, cutting memory usage by up to 95% while preserving recall through increased num_candidates in KNN.
Elasticsearch Serverless
Elasticsearch Serverless offers a fully managed, auto‑scaling search service. Its advantages include:
Zero Operations : No cluster provisioning, version upgrades, or patch management.
Pay‑as‑You‑Go : Billing is based on actual compute units (CU) consumed, measured per second.
Automatic Scaling : Resources expand or shrink transparently according to traffic.
Built‑in Vector Support : Native vector field types, automatic int8 or BBQ quantization, and pre‑warming of HNSW indexes to avoid cold‑start latency.
AI Model Integration : Inference API can call built‑in models (e.g., M2‑Encoder, Qwen2‑VL) or custom external models directly from the index pipeline.
Alibaba Cloud AI Search Open Platform
The platform provides a one‑stop AI search solution with layered architecture:
Data Sources : Connect to OSS, MySQL, Hudi, Iceberg, MaxCompute, etc.
Offline Data Service : Extract product data from RDS.
Multimodal Vector Service : Apply built‑in AI models to generate unified multimodal vectors.
Online Query Service : Front‑end sends text or image queries, which are vectorized by the AI platform and routed to Elasticsearch Serverless for multi‑path recall.
End‑to‑End Demo
The demo walks through data ingestion, vector generation, index creation in Elasticsearch Serverless, and real‑time query handling, illustrating how the components work together to deliver a fast, accurate multimodal product search experience.
Overall, the article shows how combining embedding, vector retrieval, quantization, and a serverless search backend enables developers to build powerful multimodal search systems with minimal operational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
