Artificial Intelligence 12 min read

AI Techniques for a Global Search Platform: Word Segmentation, Text Similarity, Image Retrieval, and Multimodal Models

This article shares the development of a global search platform that leverages AI technologies such as Chinese word segmentation, part‑of‑speech tagging, text similarity via Simhash and Synonyms, image similarity using histogram, Hamming distance and ResNet‑50, and multimodal CLIP‑based models to improve search efficiency and accuracy.

NetEase LeiHuo Testing Center

Jun 2, 2023

AI Techniques for a Global Search Platform: Word Segmentation, Text Similarity, Image Retrieval, and Multimodal Models

In the second half of last year the testing team began building a global retrieval platform to quickly locate texts and resources during daily work, and explored several AI‑related components to meet common requirements.

01. Word‑segmentation and POS tagging – To speed up query processing while preserving accuracy, Chinese tokenisation and part‑of‑speech (POS) tagging are required. Chinese lacks clear morphological cues, leading to high ambiguity and many polysemous words. After evaluating traditional rule‑based methods, dictionary‑based tools (Jieba, SnowNLP, THULAC) and selecting based on cost and accuracy, the team chose the jieba library for both segmentation and POS tagging.

02. Text similarity and synonym handling – The platform needs to compute similarity between input strings and return pre‑configured results when a threshold is exceeded. Text similarity can be measured by character‑level (simhash), vector‑space (Euclidean or cosine distance via scipy / sklearn), or semantic similarity (the synonyms library). After testing, the synonyms library was adopted for its stable performance on Chinese synonym detection.

03. Pre‑trained model and image‑search – For “search‑by‑image” the system must recognise whether two pictures belong to the same class and compute a similarity score. Techniques discussed include image fingerprinting (hashing), Hamming distance, histogram comparison, cosine distance on feature vectors, SSIM, and deep‑learning features from a ResNet‑50 pre‑trained model. The final solution uses ResNet‑50 embeddings stored in Milvus to perform fast similarity queries.

04. Multimodal pre‑trained model and text‑to‑image search – The goal is to input a textual description (e.g., “red‑haired little girl”) and retrieve matching images. Multimodal models such as OpenAI’s CLIP learn aligned image‑text embeddings, but their Chinese performance is limited. By employing bilingual parallel corpora, knowledge distillation, and model compression, a Chinese‑adapted CLIP‑style model was built, achieving satisfactory retrieval results.

05. Postscript – AI is a tool, not an end; importing libraries like TensorFlow or pandas does not matter as long as the chosen technique solves the problem efficiently. The article also outlines future extensions such as automatic tagging of platform submissions, resource deduplication, and multimodal search for audio or video.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Multimodal NLP text similarity Pretrained Models image retrieval

Written by

NetEase LeiHuo Testing Center

LeiHuo Testing Center provides high-quality, efficient QA services, striving to become a leading testing team in China.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.