How LLMs and Vector Search Power Real-Time Icon Recommendations
This article explains a system that combines large language models with multimodal vector retrieval to automatically understand user intent and instantly recommend the most relevant icons, detailing the workflow, semantic vectorization, offline indexing, online inference, and evaluation methods.
Background
In AntV infographic design, icons serve to visualize abstract text.
Overall Process
The solution merges the semantic understanding of large language models (LLM) with the efficient matching of vector retrieval.
Text Parsing: From Text to Visual Concepts
The first step is deep comprehension of user intent. Users often express abstract ideas (e.g., “sustainable development”), while icon retrieval relies on concrete tags (e.g., “leaf”, “recycle”, “earth”).
Example user input:
"To reduce carbon emissions, the city increases greening and public transportation."
Traditional keyword segmentation would split the sentence into many words, but only a few are suitable for visual representation.
We use an LLM to analyze the whole sentence, identify the most visual concepts, and output a keyword list: [碳排放, 绿化, 公共交通] These Chinese concepts are then translated into English to align with pre‑trained multimodal models such as CLIP:
[carbon emission, urban greenery, public transportation]Semantic Vectorization: Building a Unified Text‑Image Language
After understanding intent, the keywords are encoded into semantic vectors using OpenAI’s CLIP model, which maps text and images into the same high‑dimensional space.
"In this space, the word ‘banana’ and a banana image are encoded as vectors that lie very close together, while ‘apple’ is a bit farther and ‘cat’ is far away."
Cosine similarity measures the distance between any text vector and any icon vector; a higher score indicates stronger semantic relevance.
Real‑Time Recommendation: Vector Database and HNSW Algorithm
Because the icon library can contain millions of items, brute‑force similarity calculation is infeasible. We therefore store pre‑computed icon vectors in a vector database and build an HNSW (Hierarchical Navigable Small World) index for fast approximate nearest‑neighbor search.
Offline Processing and Index Construction
Icons are first standardized (size, background removal), then encoded with CLIP into vectors along with their tags and styles.
{
"id": "icon_1023",
"tags": ["urban", "tree", "greenery"],
"style": "flat",
"embedding": [0.12, -0.34, ..., 0.08]
}All data are stored in the vector database with an index.
Online Inference: Real‑Time Matching
When a user submits text, the extracted keywords (e.g., "urban greenery") are fed to the CLIP text encoder to produce a text embedding:
text_embedding = clip.encode_text("urban greenery")
image_embedding = clip.encode_image(icon.png)The cosine similarity between the text embedding and each stored icon embedding is computed:
similarity = cosine_similarity(text_embedding, image_embedding)Icons with the highest similarity scores are returned to the user.
Mixed Query: Condition Filtering + Semantic Matching
Beyond pure semantic similarity, additional constraints (e.g., style = 'flat') are applied first to filter a candidate subset, then semantic search is performed within that subset.
Filter by structured conditions such as style = 'flat'.
Convert the remaining textual concept (e.g., "public transportation") into a vector and perform similarity search.
This yields results that are both semantically relevant and style‑compatible.
Effect Demonstration
For the example sentence “sales revenue decreased because core product market share fell 12%”, the system recommends a set of appropriate icons.
How Good Is the Recommendation?
We evaluate the system with a three‑part framework: offline testing, A/B experiments, and long‑term monitoring, using both human judgments and AI scores.
Human + AI Evaluation
A test set of 100 real sentences covering abstract concepts to concrete objects was created. Human reviewers scored each recommendation on a 1‑5 scale, while an AI multimodal model (Qwen2.5‑VL‑72B‑Instruct) provided automatic scores.
Average scores: Human 4.06, AI 4.09, indicating strong alignment between AI predictions and human perception.
A/B Experiments
Key variables include whether LLM‑based keyword extraction is used (improving scores from 4.0 to 4.09) and different embedding models or retrieval parameters.
User Behavior Tracking
Real user interactions are monitored after deployment. Metrics such as Replacement Rate (percentage of users who replace the recommended icon) and Top‑K Adoption Rate (position of the accepted icon in the list) guide further improvements.
These data form a feedback loop that helps the recommendation engine become increasingly accurate and user‑friendly.
References
https://milvus.io/blog/what-is-a-vector-database.md
https://medium.com/@serkan_ozal/vector-similarity-search-53ed42b951d9
https://mlops.community/vector-similarity-search-from-basics-to-production/
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
