Artificial Intelligence 8 min read

Build an End-to-End Image-and-Text Search Engine with CLIP and ESCloud

This guide shows how to quickly create a complete image-and-text search solution using Volcano Engine's ESCloud, the CLIP model for feature extraction, and Python, covering data preparation, environment setup, index mapping, bulk indexing, and both text-to-image and image-to-image queries.

Volcano Engine Developer Services

Aug 11, 2023

Build an End-to-End Image-and-Text Search Engine with CLIP and ESCloud

Image search is widely used in e‑commerce, advertising, design, and search engines, allowing users to find matching or similar images by entering text descriptions or uploading pictures.

Principle Introduction

The system extracts features from both images and text using the CLIP model, establishes a correspondence between them, and performs vector similarity search in a large image database to return the most relevant results. Feature extraction uses CLIP, while vector retrieval is powered by Volcano Engine's ESCloud.

Environment Preparation

1. Log in to Volcano Engine Cloud Search, create an instance cluster and select version 7.10.

2. Install required Python dependencies:

pip install -U sentence-transformers</code><code>pip install -U elasticsearch7==7.10.1</code><code>pip install -U pandas

Dataset Preparation

We use the Unsplash Lite dataset (~25,000 photos). After downloading the zip, a CSV file provides image URLs, which are read with pandas.

def read_imgset():
    path = '${downloaded_dataset_path}'
    documents = ['photos', 'keywords', 'collections', 'conversions', 'colors']
    datasets = {}
    for doc in documents:
        files = glob.glob(path + doc + ".tsv*")
        subsets = []
        for filename in files:
            df = pd.read_csv(filename, sep='\t', header=0)
            subsets.append(df)
        datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)
    return datasets

Model Selection

The clip‑ViT‑B‑32 model (based on OpenAI 2021 paper) is chosen for both image‑to‑image and text‑to‑image search, as it can jointly represent images and text.

ESCloud Mapping Preparation

PUT image_search
{
  "mappings": {
    "dynamic": "false",
    "properties": {
      "photo_id": { "type": "keyword" },
      "photo_url": { "type": "keyword" },
      "describe": { "type": "text" },
      "photo_embedding": { "type": "knn_vector", "dimension": 512 }
    }
  },
  "settings": {
    "index": {
      "refresh_interval": "60s",
      "number_of_shards": "3",
      "knn.space_type": "cosinesimil",
      "knn": "true",
      "number_of_replicas": "1"
    }
  }
}

ESCloud Database Operations

Connection

Connect to the cloud search instance:

cloudSearch = CloudSearch("https://{user}:{password}@{ES_URL}", verify_certs=False, ssl_show_warn=False)

Write

from sentence_transformers import SentenceTransformer
from elasticsearch7 import Elasticsearch as CloudSearch
from PIL import Image
import requests, pandas as pd, glob

img_model = SentenceTransformer('clip-ViT-B-32')
text_model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')

def encodedataset(photo_id, photo_url, describe, image):
    return {
        "photo_id": photo_id,
        "photo_url": photo_url,
        "describe": describe,
        "photo_embedding": img_model.encode(image)
    }

def load_image(url_or_path):
    if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
        return Image.open(requests.get(url_or_path, stream=True).raw)
    return Image.open(url_or_path)

def get_imgset_and_bulk():
    datasets = read_imgset()
    kwywords = datasets['keywords']
    docs = []
    for idx, row in datasets['photos'].iterrows():
        photo_url = row["photo_image_url"]
        photo_id = row["photo_id"]
        image = load_image(photo_url)
        filter = kwywords.loc[(kwywords['photo_id'] == photo_id) & (kwywords['suggested_by_user'] == 't')]
        text = ' '.join(set(filter['keyword']))
        one_document = encodedataset(photo_id, photo_url, text, image)
        docs.append({"index": {}})
        docs.append(one_document)
        if idx % 20 == 0:
            resp = cloudSearch.bulk(docs, index='image_search')
            print(resp)
            docs = []
    return docs

if __name__ == '__main__':
    docs = get_imgset_and_bulk()
    print(docs)

Query

Text‑to‑Image

def extract_text(text):
    res = cloudSearch.search(
        body={
            "size": 5,
            "query": {"knn": {"photo_embedding": {"vector": text_model.encode(text), "k": 5}}},
            "_source": ["describe", "photo_url"]
        },
        index="image_search2"
    )
    return res

Image‑to‑Image

def extract(img):
    res = cloudSearch.search(
        body={
            "size": 5,
            "query": {"knn": {"photo_embedding": {"vector": img_model.encode(img), "k": 5}}},
            "_source": ["describe", "photo_url"]
        },
        index="image_search2"
    )
    return res

Volcano Engine's ESCloud is compatible with Elasticsearch, Kibana and common plugins, offering structured and unstructured text search, statistics, and reporting, with one‑click deployment, elastic scaling, and simplified operations for log analysis and information retrieval.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python ElasticSearch image search CLIP Vector Retrieval cloud search

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.