Tag Mining for Used‑Car Business: NLP, Word2Vec, and Retrieval Pipeline
This article details the end‑to‑end process of extracting and leveraging tags for used‑car listings, covering data collection, segmentation, NLP‑based tokenization, word‑vector generation, tag‑library construction, and online retrieval flow to improve personalized recall and CTR.
The article introduces a tagging approach for the used‑car market, explaining how labeling information and user behavior enables personalized retrieval and enhances conversion efficiency across four key user‑action stages: traffic entry, search, click, and phone.
It explains why mining business‑category tags is valuable: tags provide a flexible, personalized retrieval method that goes beyond static rule‑based search, allowing the system to surface long‑tail items and enrich item descriptions with attributes such as "young", "powerful", or "stylish".
The core workflow is described in six stages: (1) corpus collection from listings, news, search logs, and reviews; (2) segmentation using HanLP (with manual annotation and filtering); (3) building a tag library by aggregating and classifying filtered terms; (4) computing word vectors with Word2Vec to expand tag recall via similarity; (5) integrating tags into online experiments and feedback loops; (6) maintaining the tag library through iterative updates.
Detailed segmentation iterations are presented, showing the evolution from simple POS‑based tokenization to a structured perceptron model and the incorporation of a custom car‑model dictionary, which significantly improved tag quality.
Word‑vector generation is explained, highlighting how cosine similarity between tag vectors helps discover related tags and expand recall, while still requiring human verification.
The tag library is organized hierarchically (brand → series → model) and stored with fields such as ID, name, and associated vehicle IDs, enabling multi‑level retrieval and focusing recall at the series level for better relevance.
Finally, the online tag‑recall sequence is outlined: user top‑N behaviors map to tags, tags are expanded via similarity, tags query the search service, and the resulting items are presented to the user, with continuous monitoring of CTR and quality to drive further refinements.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.