Auto-Label Missing POI Categories Using Naive Bayes and Feature Selection
This article details a step‑by‑step machine‑learning pipeline that transforms over one million calibrated POI records into feature vectors, selects discriminative terms via information‑gain and domain rules, trains a Naive Bayes classifier, and achieves 91% accuracy with 84% coverage on unseen POI data.
Problem Overview
The Merchant Data Center (MDC) holds over one million calibrated POI records. Many POIs lack a category label. The task is to infer the missing category from the POI name using machine‑learning techniques, treating each name as a short Chinese text document.
Feature Representation
Names are tokenized with Lucene’s SmartCn analyzer to build a global dictionary of terms. Each POI is represented by a Boolean vector whose length equals the dictionary size; a dimension is 1 if the corresponding term appears in the name, otherwise 0. During dictionary construction two frequency tables are populated:
A(i,j) = count of term i in category j
T(j) = number of POIs belonging to category j
N = total number of calibrated POIsThese tables are later used for probability estimation.
Feature Selection
Statistical (Information‑Gain) Method
Information gain (IG) ranks terms by their discriminative power: IG(t) = H(C) – H(C|t) Entropy H(C) is computed over the category distribution, and conditional entropy H(C|t) over the distribution of categories given the term. The top 30 % of terms (≈20 terms) are retained, e.g., 酒店, 宾馆, 火锅, 摄影, 眼镜, 美容, 咖啡, KTV, 造型, 汽车, 餐厅, 蛋糕, 儿童, 美发, 商务, 旅行社, 婚纱, 会所, 影城, 烤肉.
Domain‑Knowledge Rule
POI names often follow the pattern brand core + category word . A heuristic discards the leading brand tokens and keeps the trailing category‑related tokens. This rule‑based trimming improves accuracy by about 5 % while slightly reducing coverage.
Naive Bayes Classification Model
Model Variants
Multivariate Bernoulli model – uses binary presence/absence of each term.
Multinomial event model – uses term frequencies within the name.
Parameter Estimation
Maximum‑likelihood estimation with Laplace (add‑1) smoothing is applied. For the multinomial model the smoothed conditional probability is:
P(t_i|C_j) = (count(t_i, C_j) + 1) / ( Σ_k count(t_k, C_j) + |V| )Computations are performed in log‑space to avoid underflow, and probabilities are lower‑bounded by 1e‑6.
Prediction Procedure (Illustrative Example)
Tokenize the POI name with SmartCn.
Apply the rule‑based trimming to drop brand tokens.
Map the remaining tokens to a Boolean vector over a small dictionary, e.g., [拉面, 七天, 牛肉, 馆]. The name “好再来牛肉拉面馆” becomes [1,0,1,1].
Compute log‑posterior probabilities for each class (e.g., 火锅 vs. 快餐) using the smoothed conditional probabilities.
Select the class with the higher posterior; in the example the fast‑food class receives a probability four times larger than the hot‑pot class.
Evaluation
On a random sample of 2,000 uncategorized POIs:
Coverage (fraction of POIs for which a prediction is produced): 84 % .
Accuracy (correct predictions among covered POIs): 91 % .
Key Takeaways
Formulate the problem as short‑text classification before selecting algorithms.
Use a Boolean vector‑space model with a domain‑specific dictionary and information‑gain based feature selection for Chinese POI names.
Naive Bayes with Laplace smoothing and log‑space computation provides a fast, effective baseline; rule‑based removal of brand tokens can further improve accuracy at modest coverage loss.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
