From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

This article walks through a complete multimodal product search solution, explaining how embedding and vector retrieval technologies—combined with Elasticsearch Serverless and Alibaba Cloud AI Search—enable image‑based and semantic queries, detailing the architecture, key algorithms, quantization tricks, and practical deployment steps.

DataFunSummit
DataFunSummit
DataFunSummit
From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

Evolution of Search Scenarios

Traditional e‑commerce search relies on keyword matching, but users now expect to search by images or natural‑language descriptions of complex scenes. Pure text search cannot capture visual attributes such as color, pattern, or brand details, leading to gaps that multimodal and cross‑modal search aim to fill.

General Multimodal Product Search Architecture

The solution consists of three layers:

Data Processing Layer

Text Metadata : Extract title, description, category, tags, etc., from the product database and index them in a traditional text engine.

Image Understanding : Use a multimodal large model to generate descriptive text for each image (e.g., "a green short‑sleeve kids' shorts with a cartoon dinosaur"). This description is then tokenized and indexed like regular text.

Embedding : Convert both textual and visual data into high‑dimensional vectors ("vectors") and store them in a vector engine.

Query & Recall Layer

Text Query : Direct keyword matching in the text engine and semantic matching by converting the query into a vector via embedding.

Image Query : Upload an image, embed it into a vector, and perform similarity search in the vector engine.

Fusion & Ranking Layer

Results from the text engine and vector engine are merged and re‑ranked (Rerank module) using combined relevance scores.

The final Top‑N list is returned to the user.

image
image

Key Technologies

1. Embedding (Vectorization)

Embedding maps unstructured data (text, images) to structured numeric vectors. Three model families are commonly used:

Dense Model (e.g., Word2vec, S‑BERT, LLM‑based encoders) produces dense vectors where most dimensions are non‑zero, capturing deep semantic similarity.

Sparse Model (e.g., BM25, SPLADE) generates sparse vectors with only a few non‑zero entries, emphasizing exact term matching.

Hybrid Model combines dense and sparse representations, delivering both semantic breadth and keyword precision, and consistently outperforms single‑mode approaches in benchmarks.

2. Vector Retrieval

Vector search finds the K nearest neighbors (KNN) of a query vector in a high‑dimensional space. Common similarity measures include:

Euclidean Distance (L2) : 1 / (1 + L2_norm^2) converts distance to a normalized score (smaller distance → higher score).

Dot Product : Equivalent to cosine similarity when vectors are normalized; larger dot product means higher similarity.

Cosine Similarity : Ranges from -1 to 1, with 1 indicating identical direction.

Elasticsearch provides native KNN APIs for dense_vector fields and supports hybrid search that combines match queries with KNN. To fuse scores from heterogeneous sources, Elasticsearch implements Reciprocal Rank Fusion (RRF), which aggregates rankings rather than raw scores.

3. Performance Optimisation via Quantisation

High‑dimensional float32 vectors (e.g., 1024‑dim) consume large memory. Quantisation reduces storage while preserving recall:

Scalar Quantisation (SQ) : Maps each float to an int8 (1 byte) or int4 (0.5 byte) by linearly scaling values within the segment’s min‑max range, cutting memory to 1/4 or 1/8.

Better Binary Quantisation (BBQ) : Builds on SQ and further compresses vectors, achieving up to 95 % memory reduction at the cost of a controllable recall drop. Increasing num_candidates in KNN queries can mitigate the loss.

For a 100‑billion 1024‑dim float32 dataset (≈37 TB), applying BBQ and HNSW indexing reduces total memory to ~1.8 TB and cuts required compute nodes from 170 to 9, dramatically lowering cost.

image
image

Elasticsearch Vector Support

Elasticsearch now offers dedicated field types: dense_vector: Stores dense vectors. sparse_vector: Stores high‑dimensional sparse vectors efficiently. semantic_text: An abstract type that automatically maps text to the appropriate vector representation based on the configured inference model.

Two integration mechanisms simplify vectorisation:

Inference API : Calls external AI models (e.g., M2‑Encoder, Qwen2‑VL) directly from Elasticsearch during indexing or query time.

Ingest Pipeline : Uses processors such as text_embedding or inference to convert fields to vectors on the fly, reducing pipeline complexity.

Elasticsearch Serverless – Cloud‑Native Vector Search Service

Serverless abstracts cluster management behind a proxy that handles authentication, routing, and request rewriting. The platform automatically scales compute units (CU), provides per‑second billing, and offers built‑in monitoring of QPS and index traffic.

Key advantages:

Zero Operations : No need to manage nodes, shards, or version upgrades; the service always runs the latest stable kernel.

Cost Efficiency : Pay‑as‑you‑go CU model eliminates fixed‑price contracts and matches traffic spikes with automatic scaling.

Seamless AI Integration : All models from the Alibaba Cloud AI Search Open Platform are available via the Inference API. Custom or third‑party models can be plugged in through simple API configuration.

Vector Optimisations : Default quantisation (int8 or BBQ) can be toggled without code changes; vector fields are excluded from _source to save storage and bandwidth; automatic pre‑warming of HNSW indexes reduces cold‑start latency.

image
image

Best‑Practice End‑to‑End Workflow

Data Source : Product records (ID, text, image) reside in an RDS instance.

Offline Data Service : Extracts records from RDS.

Multimodal Vector Service : Calls AI Search Open Platform models (e.g., M2‑Encoder, Qwen2‑VL) to produce unified multimodal vectors.

Data Write : Writes both the original text fields and the generated vectors into Elasticsearch Serverless, populating a text engine and a vector engine.

Online Query : Front‑end sends a text or image query; the query is vectorised via the AI Search platform, then dispatched to Elasticsearch Serverless for simultaneous text and vector recall, followed by Rerank and final Top‑N return.

image
image

Demo Overview

The accompanying demo video shows the complete pipeline: ingesting product data, generating multimodal embeddings, indexing them in Elasticsearch Serverless, and performing both text‑based and image‑based searches with instant relevance feedback.

Overall, the article demonstrates how to leverage modern embedding techniques, efficient vector quantisation, and a fully managed serverless search service to build a high‑performance, cost‑effective multimodal product search system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ServerlessQuantizationElasticsearchEmbeddingVector RetrievalAI searchMultimodal Search
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.