How to Build an Image Similarity Search System with ResNet, Milvus, and YOLO

This article walks through the end‑to‑end process of building an image similarity solution—from vectorizing images with ResNet, storing high‑dimensional vectors in Milvus, using HNSW for fast ANN search, to applying YOLO for object detection and practical training tips.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
How to Build an Image Similarity Search System with ResNet, Milvus, and YOLO

1 Introduction

While working on a 2D‑product project, I needed a photo‑based product recognition feature. Starting from zero, I learned and implemented the necessary techniques, documenting the journey for others exploring image recognition.

2 Fundamentals of Image Similarity

2.1 Understanding Vectors

Images must be converted into numerical form. Vectorization transforms an image into a high‑dimensional feature vector (often 512‑ or 1024‑dimensional) where each dimension captures attributes such as color, texture, or shape. Similarity then becomes a vector similarity problem.

Vector illustration
Vector illustration
Note: The illustration simplifies vectors to 3‑D; real applications use 512‑ or 1024‑D vectors with hundreds of features.

2.2 Learning Vectorization Algorithms

The steepest learning curve involved CNNs. Convolutional layers apply learnable filters across the image to extract hierarchical features. I used a pre‑trained ResNet‑50 model as a feature extractor, producing a 2048‑dimensional vector for each image.

import torch
from torchvision.models import resnet50, ResNet50_Weights
from torchvision import transforms
from PIL import Image

# Load pre‑trained model without classification head
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
model = torch.nn.Sequential(*list(model.children())[:-1])
model.eval()

# Preprocess image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

image = Image.open('sample.png')
image_tensor = transform(image).unsqueeze(0)
features = model(image_tensor)
print(f"Feature vector shape: {features.shape}")  # (1, 2048, 1, 1)

2.3 Learning Vector Databases

Traditional relational databases cannot handle high‑dimensional similarity search efficiently. Vector databases such as Milvus are optimized for ANN queries on millions of vectors. Alternatives include Pinecone (cloud‑native) and Weaviate (multimodal support).

2.4 Understanding HNSW Index

HNSW (Hierarchical Navigable Small World) provides an approximate nearest‑neighbor search by building a multi‑layer graph. The top layer contains few nodes for rapid coarse search, the middle layer refines the region, and the bottom layer holds all data points for exact retrieval.

HNSW index structure
HNSW index structure

2.5 Similarity Metrics: From Semantic Understanding to Mathematical Computation

To translate human visual similarity into a computable form, common distance measures are used:

Algorithm

Principle

Features

Use Cases

Cosine

Angle between vectors

Direction‑only, length‑invariant

Text & image matching

Euclidean (L2)

Straight‑line distance

Clear geometric meaning, suffers in high dimensions

Low‑dimensional precise matching

Manhattan (L1)

Sum of absolute differences

Simple, robust to outliers

Grid‑like data, city‑block distance

Inner Product (IP)

Vector dot product

Efficient, considers magnitude

Recommendation systems

Hamming

Count of differing bits

Binary data only

Error detection, binary features

These metrics convert subjective visual judgments into objective numerical scores.

3 Solving Real‑World Problems

3.1 Necessity of Object Detection

User‑uploaded photos often contain irrelevant background. Detecting and cropping the product region before vectorization dramatically improves accuracy. I chose YOLO for its balance of speed and precision.

from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('sample.png')
results[0].show()
YOLO detection
YOLO detection

3.2 Challenges of Data Annotation

Our domain (anime cards, figurines, badges) lacks generic models, so we annotated data ourselves using Label Studio, which offers an intuitive UI, team collaboration, and YOLO‑compatible export.

# Install Label Studio
pip install label-studio
# Start the service
label-studio start
# Open http://localhost:8080 in a browser
Label Studio annotation demo
Label Studio annotation demo

3.3 Training Experience

With ~200 annotated images, training YOLO was straightforward. Using an RTX 4090, training finished in ~2 minutes, achieving 97.6 % mAP and a 6 MB model.

from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model.train(
    data='dataset/dataset.yaml',
    epochs=200,
    imgsz=640,
    batch=32,
    device=0,
    patience=20,
)
YOLO training log
YOLO training log

Key takeaways:

Data quality matters more than quantity – 200 well‑labeled images yielded good results.

GPU speeds up training dramatically compared to CPU.

YOLO is easy to use – a few lines of code handle the whole pipeline.

4 Future Plans

4.1 Ideal System Architecture

The envisioned pipeline consists of four stages:

Model training : Annotate data with Label Studio, train a custom YOLO detector.

Data preprocessing : Use the detector to crop products, extract ResNet features, store vectors in Milvus.

Real‑time retrieval : On user upload, run YOLO, extract features, perform ANN search.

Feedback loop : Collect user feedback, analyze failures, continuously improve the model.

Full system flowchart
Full system flowchart

4.2 Current Progress and Challenges

Although each component works in isolation, integrating them into a stable product faces several hurdles:

Data quality bottleneck : 200 images are insufficient; more diverse samples are needed.

Hyper‑parameter tuning : Learning rate, batch size, augmentation strategies require extensive experimentation.

Model generalization : Tested only on cards; performance on other product types is unknown.

Edge cases : Blurry, occluded, or poorly lit images still challenge accuracy.

Performance scaling : ANN search latency must be optimized for large‑scale deployments.

Addressing each issue brings the solution closer to a production‑ready system.

5 Conclusion

This hands‑on journey deepened my understanding of image recognition pipelines. Despite difficulties, the iterative “learn‑by‑doing” approach proved valuable, and I look forward to sharing more practical experiences.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

computer visionVector DatabaseMilvusHNSWimage similarityResNetYOLO
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.