How to Build an Image Similarity Search System with ResNet, Milvus, and YOLO
This article walks through the end‑to‑end process of building an image similarity solution—from vectorizing images with ResNet, storing high‑dimensional vectors in Milvus, using HNSW for fast ANN search, to applying YOLO for object detection and practical training tips.
1 Introduction
While working on a 2D‑product project, I needed a photo‑based product recognition feature. Starting from zero, I learned and implemented the necessary techniques, documenting the journey for others exploring image recognition.
2 Fundamentals of Image Similarity
2.1 Understanding Vectors
Images must be converted into numerical form. Vectorization transforms an image into a high‑dimensional feature vector (often 512‑ or 1024‑dimensional) where each dimension captures attributes such as color, texture, or shape. Similarity then becomes a vector similarity problem.
Note: The illustration simplifies vectors to 3‑D; real applications use 512‑ or 1024‑D vectors with hundreds of features.
2.2 Learning Vectorization Algorithms
The steepest learning curve involved CNNs. Convolutional layers apply learnable filters across the image to extract hierarchical features. I used a pre‑trained ResNet‑50 model as a feature extractor, producing a 2048‑dimensional vector for each image.
import torch
from torchvision.models import resnet50, ResNet50_Weights
from torchvision import transforms
from PIL import Image
# Load pre‑trained model without classification head
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
model = torch.nn.Sequential(*list(model.children())[:-1])
model.eval()
# Preprocess image
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
image = Image.open('sample.png')
image_tensor = transform(image).unsqueeze(0)
features = model(image_tensor)
print(f"Feature vector shape: {features.shape}") # (1, 2048, 1, 1)2.3 Learning Vector Databases
Traditional relational databases cannot handle high‑dimensional similarity search efficiently. Vector databases such as Milvus are optimized for ANN queries on millions of vectors. Alternatives include Pinecone (cloud‑native) and Weaviate (multimodal support).
2.4 Understanding HNSW Index
HNSW (Hierarchical Navigable Small World) provides an approximate nearest‑neighbor search by building a multi‑layer graph. The top layer contains few nodes for rapid coarse search, the middle layer refines the region, and the bottom layer holds all data points for exact retrieval.
2.5 Similarity Metrics: From Semantic Understanding to Mathematical Computation
To translate human visual similarity into a computable form, common distance measures are used:
Algorithm
Principle
Features
Use Cases
Cosine
Angle between vectors
Direction‑only, length‑invariant
Text & image matching
Euclidean (L2)
Straight‑line distance
Clear geometric meaning, suffers in high dimensions
Low‑dimensional precise matching
Manhattan (L1)
Sum of absolute differences
Simple, robust to outliers
Grid‑like data, city‑block distance
Inner Product (IP)
Vector dot product
Efficient, considers magnitude
Recommendation systems
Hamming
Count of differing bits
Binary data only
Error detection, binary features
These metrics convert subjective visual judgments into objective numerical scores.
3 Solving Real‑World Problems
3.1 Necessity of Object Detection
User‑uploaded photos often contain irrelevant background. Detecting and cropping the product region before vectorization dramatically improves accuracy. I chose YOLO for its balance of speed and precision.
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('sample.png')
results[0].show()3.2 Challenges of Data Annotation
Our domain (anime cards, figurines, badges) lacks generic models, so we annotated data ourselves using Label Studio, which offers an intuitive UI, team collaboration, and YOLO‑compatible export.
# Install Label Studio
pip install label-studio
# Start the service
label-studio start
# Open http://localhost:8080 in a browser3.3 Training Experience
With ~200 annotated images, training YOLO was straightforward. Using an RTX 4090, training finished in ~2 minutes, achieving 97.6 % mAP and a 6 MB model.
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model.train(
data='dataset/dataset.yaml',
epochs=200,
imgsz=640,
batch=32,
device=0,
patience=20,
)Key takeaways:
Data quality matters more than quantity – 200 well‑labeled images yielded good results.
GPU speeds up training dramatically compared to CPU.
YOLO is easy to use – a few lines of code handle the whole pipeline.
4 Future Plans
4.1 Ideal System Architecture
The envisioned pipeline consists of four stages:
Model training : Annotate data with Label Studio, train a custom YOLO detector.
Data preprocessing : Use the detector to crop products, extract ResNet features, store vectors in Milvus.
Real‑time retrieval : On user upload, run YOLO, extract features, perform ANN search.
Feedback loop : Collect user feedback, analyze failures, continuously improve the model.
4.2 Current Progress and Challenges
Although each component works in isolation, integrating them into a stable product faces several hurdles:
Data quality bottleneck : 200 images are insufficient; more diverse samples are needed.
Hyper‑parameter tuning : Learning rate, batch size, augmentation strategies require extensive experimentation.
Model generalization : Tested only on cards; performance on other product types is unknown.
Edge cases : Blurry, occluded, or poorly lit images still challenge accuracy.
Performance scaling : ANN search latency must be optimized for large‑scale deployments.
Addressing each issue brings the solution closer to a production‑ready system.
5 Conclusion
This hands‑on journey deepened my understanding of image recognition pipelines. Despite difficulties, the iterative “learn‑by‑doing” approach proved valuable, and I look forward to sharing more practical experiences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
