Artificial Intelligence 11 min read

Concept Tag Mining for Recommendation Systems: Methods, Challenges, and Solutions

This article presents a comprehensive overview of concept tag mining for recommendation systems, describing unsupervised pattern‑matching, semi‑supervised AutoPhase, and supervised NER approaches, analyzing their advantages and drawbacks, and offering practical solutions to tag duplication and quality issues.

HomeTech

Sep 8, 2022

Concept Tag Mining for Recommendation Systems: Methods, Challenges, and Solutions

Purpose – Concepts encapsulate world knowledge and guide human cognition; extracting concept tags from UGC/PGC documents is crucial for recommendation systems, where tags must align with user interests and cognition.

Concept tag extraction from query logs is explored through three candidate generation methods: an unsupervised pattern‑matching approach, a semi‑supervised AutoPhase framework, and a supervised NER‑based method.

1. Unsupervised pattern‑matching – Pre‑defined patterns (e.g., "最XXX的车", "适合XXX的车") are used to extract concepts directly from user queries, leveraging query frequency and click‑through rates to filter and rank concepts. New patterns are discovered automatically either by mining queries containing known concepts or by masking brand/model tokens and extracting novel patterns.

2. Semi‑supervised AutoPhase – AutoPhase extracts phrase candidates using remote supervision from Wikipedia entries (positive pool) and noisy negatives. A decision‑tree‑based classifier is trained on these pools, and high‑quality phrases are filtered by rules (brand presence, part‑of‑speech, length) and semantic similarity with queries.

3. Supervised NER – A BIO‑tagging scheme is applied to tokens identified as concepts, using training data generated from the previous two methods. A BERT‑CRF model is fine‑tuned with separate learning rates for BERT and CRF layers, achieving higher precision and recall than the unsupervised approaches.

Tag duplication problem – Duplicate or overlapping tags (e.g., synonyms or hierarchical relations) cause redundancy and evaluation errors. Two mitigation strategies are proposed: (a) edit‑distance and synonym detection to merge similar tags, and (b) semantic similarity scoring (e.g., SimBERT) to identify and remove near‑duplicate concepts.

Conclusion – The article demonstrates a pipeline combining unsupervised, semi‑supervised, and supervised techniques for robust concept tag mining, addresses practical issues such as tag duplication, and outlines future work on concept labeling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP unsupervised learning Semi-supervised Learning NER concept tagging tag duplication

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.