Artificial Intelligence 12 min read

Build a Hybrid Keyword‑Semantic Search with Volcengine Cloud Search

This guide explains how to combine traditional keyword search and vector‑based semantic search into a hybrid retrieval system using Volcengine Cloud Search, covering the underlying concepts, required components, workflow steps, and a complete Python implementation for an image search application.

Volcano Engine Developer Services

May 23, 2024

Build a Hybrid Keyword‑Semantic Search with Volcengine Cloud Search

Traditional keyword search provides low latency and explainability but ignores context, while semantic search converts texts, images, and videos into vectors to capture similarity; combining them requires a hybrid search that runs each query clause independently, gathers shard‑level scores, normalizes them, and merges the results.

Implementation Overview

Volcengine Cloud Search, built on open‑source Elasticsearch and OpenSearch, supports both full‑text and vector retrieval and offers out‑of‑the‑box hybrid search capabilities. The example demonstrates building an image search application.

Required Components

Full‑text search engine

Vector search engine

Machine‑learning model for embeddings

Data pipeline to convert media to vectors

Fusion ranking

Hybrid Search Workflow

Query stage: use hybrid query clauses for keyword and semantic search.

Score normalization and merging: normalize each clause’s scores (min_max, l2, rrf) and combine them (arithmetic_mean, geometric_mean, harmonic_mean).

Re‑rank documents based on the combined scores and return the final results.

Practical Steps

Environment Setup

Log into the Volcengine Cloud Search console, create an OpenSearch 2.9.0 cluster, and enable the AI node.

Dataset Preparation

Use the Amazon Berkeley Objects dataset. Load metadata, filter items with English titles, join with image metadata, and prepare a DataFrame for ingestion.

Install Python Dependencies

pip install -U elasticsearch7==7.10.1
pip install -U pandas
pip install -U jupyter
pip install -U requests
pip install -U s3fs
pip install -U alive_progress
pip install -U pillow
pip install -U ipython

Connect to OpenSearch

# Prepare OpenSearch connection
from elasticsearch7 import Elasticsearch as CloudSearch
from ssl import create_default_context

opensearch_domain = '{{ OPENSEARCH_DOMAIN }}'
opensearch_port = '9200'
opensearch_user = 'admin'
opensearch_pwd = '{{ OPENSEARCH_PWD }}'

model_remote_config = {
    "method": "POST",
    "url": "{{ REMOTE_MODEL_URL }}",
    "params": {},
    "headers": {"Content-Type": "application/json"},
    "advance_request_body": {"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"}
}

knn_dimension = 384
ssl_context = create_default_context(cafile='./ca.cer')
cloud_search_cli = CloudSearch([opensearch_domain, opensearch_port],
    ssl_context=ssl_context,
    scheme="https",
    http_auth=(opensearch_user, opensearch_pwd))

index_name = 'index-test'
pipeline_id = 'remote_text_embedding_test'
search_pipeline_id = 'rrf_search_pipeline_test'

Create Ingest Pipeline

# Create ingest pipeline for remote embedding
pipeline_body = {
    "description": "text embedding pipeline for remote inference",
    "processors": [{
        "remote_text_embedding": {
            "remote_config": model_remote_config,
            "field_map": {"caption": "caption_embedding"}
        }
    }]
}
resp = cloud_search_cli.ingest.put_pipeline(id=pipeline_id, body=pipeline_body)
print(resp)

Create Search Pipeline

# Create search pipeline with normalization and combination
import requests
search_pipeline_body = {
    "description": "post processor for hybrid search",
    "request_processors": [{
        "remote_embedding": {"remote_config": model_remote_config}
    }],
    "phase_results_processors": [{
        "normalization-processor": {
            "normalization": {"technique": "rrf", "parameters": {"rank_constant": 60}},
            "combination": {"technique": "arithmetic_mean", "parameters": {"weights": [0.4, 0.6]}}
        }
    }]
}
headers = {'Content-Type': 'application/json'}
resp = requests.put(
    url=f"https://{opensearch_domain}:{opensearch_port}/_search/pipeline/{search_pipeline_id}",
    auth=(opensearch_user, opensearch_pwd),
    json=search_pipeline_body,
    headers=headers,
    verify='./ca.cer')
print(resp.text)

Create k‑NN Index

# Create k‑NN index with faiss hnsw
index_body = {
    "settings": {
        "index.knn": True,
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "default_pipeline": pipeline_id
    },
    "mappings": {
        "properties": {
            "image_url": {"type": "text"},
            "caption_embedding": {"type": "knn_vector", "dimension": knn_dimension,
                "method": {"engine": "faiss", "space_type": "l2", "name": "hnsw", "parameters": {}}
            },
            "caption": {"type": "text"}
        }
    }
}
resp = cloud_search_cli.indices.create(index=index_name, body=index_body)
print(resp)

Load and Upload Dataset

# Load dataset from S3
import pandas as pd, json
# (code to read metadata, filter English titles, merge with image metadata omitted for brevity)

# Upload dataset in bulk
from alive_progress import alive_bar
cnt = 0
batch = 0
action = json.dumps({"index": {"_index": index_name}})
body_ = ''
for _, row in dataset.iterrows():
    payload = {
        "image_url": "https://amazon-berkeley-objects.s3.amazonaws.com/images/small/" + row['path'],
        "caption": row['item_name_in_en_us']
    }
    body_ += action + "
" + json.dumps(payload) + "
"
    cnt += 1
    if cnt == 100:
        resp = cloud_search_cli.bulk(request_timeout=1000, index=index_name, body=body_)
        cnt = 0
        batch += 1
        body_ = ''
        
print("Total Bulk batches completed: " + str(batch))

Hybrid Search Query Example

# Search with hybrid query using the search pipeline
def search(text, size):
    resp = cloud_search_cli.search(
        index=index_name,
        body={
            "_source": ["image_url", "caption"],
            "query": {
                "hybrid": {
                    "queries": [
                        {"match": {"caption": {"query": text}}},
                        {"remote_neural": {"caption_embedding": {"query_text": text, "k": size}}}
                    ]
                }
            }
        },
        params={"search_pipeline": search_pipeline_id}
    )
    return resp

k = 10
ret = search('shoes', k)
for item in ret['hits']['hits']:
    print(item['_source']['caption'])
    # display image using the URL if needed

Result

The hybrid search returns images and captions that are relevant both to exact keyword matches and to semantic similarity, demonstrating the effectiveness of combining keyword and vector retrieval.

Python vector search semantic search OpenSearch Hybrid Search cloud search

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.