AI Techniques for a Global Search Platform: Word Segmentation, Text Similarity, Image Retrieval, and Multimodal Models
This article shares the development of a global search platform that leverages AI technologies such as Chinese word segmentation, part‑of‑speech tagging, text similarity via Simhash and Synonyms, image similarity using histogram, Hamming distance and ResNet‑50, and multimodal CLIP‑based models to improve search efficiency and accuracy.
In the second half of last year the testing team began building a global retrieval platform to quickly locate texts and resources during daily work, and explored several AI‑related components to meet common requirements.
01. Word‑segmentation and POS tagging – To speed up query processing while preserving accuracy, Chinese tokenisation and part‑of‑speech (POS) tagging are required. Chinese lacks clear morphological cues, leading to high ambiguity and many polysemous words. After evaluating traditional rule‑based methods, dictionary‑based tools (Jieba, SnowNLP, THULAC) and selecting based on cost and accuracy, the team chose the jieba library for both segmentation and POS tagging.
02. Text similarity and synonym handling – The platform needs to compute similarity between input strings and return pre‑configured results when a threshold is exceeded. Text similarity can be measured by character‑level (simhash), vector‑space (Euclidean or cosine distance via scipy / sklearn ), or semantic similarity (the synonyms library). After testing, the synonyms library was adopted for its stable performance on Chinese synonym detection.
03. Pre‑trained model and image‑search – For “search‑by‑image” the system must recognise whether two pictures belong to the same class and compute a similarity score. Techniques discussed include image fingerprinting (hashing), Hamming distance, histogram comparison, cosine distance on feature vectors, SSIM, and deep‑learning features from a ResNet‑50 pre‑trained model. The final solution uses ResNet‑50 embeddings stored in Milvus to perform fast similarity queries.
04. Multimodal pre‑trained model and text‑to‑image search – The goal is to input a textual description (e.g., “red‑haired little girl”) and retrieve matching images. Multimodal models such as OpenAI’s CLIP learn aligned image‑text embeddings, but their Chinese performance is limited. By employing bilingual parallel corpora, knowledge distillation, and model compression, a Chinese‑adapted CLIP‑style model was built, achieving satisfactory retrieval results.
05. Postscript – AI is a tool, not an end; importing libraries like TensorFlow or pandas does not matter as long as the chosen technique solves the problem efficiently. The article also outlines future extensions such as automatic tagging of platform submissions, resource deduplication, and multimodal search for audio or video.
NetEase LeiHuo Testing Center
LeiHuo Testing Center provides high-quality, efficient QA services, striving to become a leading testing team in China.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.