Auto-Label Missing POI Categories Using Naive Bayes and Feature Selection

This article details a step‑by‑step machine‑learning pipeline that transforms over one million calibrated POI records into feature vectors, selects discriminative terms via information‑gain and domain rules, trains a Naive Bayes classifier, and achieves 91% accuracy with 84% coverage on unseen POI data.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Auto-Label Missing POI Categories Using Naive Bayes and Feature Selection

Problem Overview

The Merchant Data Center (MDC) holds over one million calibrated POI records. Many POIs lack a category label. The task is to infer the missing category from the POI name using machine‑learning techniques, treating each name as a short Chinese text document.

Feature Representation

Names are tokenized with Lucene’s SmartCn analyzer to build a global dictionary of terms. Each POI is represented by a Boolean vector whose length equals the dictionary size; a dimension is 1 if the corresponding term appears in the name, otherwise 0. During dictionary construction two frequency tables are populated:

A(i,j) = count of term i in category j
 T(j)   = number of POIs belonging to category j
 N      = total number of calibrated POIs

These tables are later used for probability estimation.

Feature Selection

Statistical (Information‑Gain) Method

Information gain (IG) ranks terms by their discriminative power: IG(t) = H(C) – H(C|t) Entropy H(C) is computed over the category distribution, and conditional entropy H(C|t) over the distribution of categories given the term. The top 30 % of terms (≈20 terms) are retained, e.g., 酒店, 宾馆, 火锅, 摄影, 眼镜, 美容, 咖啡, KTV, 造型, 汽车, 餐厅, 蛋糕, 儿童, 美发, 商务, 旅行社, 婚纱, 会所, 影城, 烤肉.

Domain‑Knowledge Rule

POI names often follow the pattern brand core + category word . A heuristic discards the leading brand tokens and keeps the trailing category‑related tokens. This rule‑based trimming improves accuracy by about 5 % while slightly reducing coverage.

Naive Bayes Classification Model

Model Variants

Multivariate Bernoulli model – uses binary presence/absence of each term.

Multinomial event model – uses term frequencies within the name.

Parameter Estimation

Maximum‑likelihood estimation with Laplace (add‑1) smoothing is applied. For the multinomial model the smoothed conditional probability is:

P(t_i|C_j) = (count(t_i, C_j) + 1) / ( Σ_k count(t_k, C_j) + |V| )

Computations are performed in log‑space to avoid underflow, and probabilities are lower‑bounded by 1e‑6.

Prediction Procedure (Illustrative Example)

Tokenize the POI name with SmartCn.

Apply the rule‑based trimming to drop brand tokens.

Map the remaining tokens to a Boolean vector over a small dictionary, e.g., [拉面, 七天, 牛肉, 馆]. The name “好再来牛肉拉面馆” becomes [1,0,1,1].

Compute log‑posterior probabilities for each class (e.g., 火锅 vs. 快餐) using the smoothed conditional probabilities.

Select the class with the higher posterior; in the example the fast‑food class receives a probability four times larger than the hot‑pot class.

Evaluation

On a random sample of 2,000 uncategorized POIs:

Coverage (fraction of POIs for which a prediction is produced): 84 % .

Accuracy (correct predictions among covered POIs): 91 % .

Key Takeaways

Formulate the problem as short‑text classification before selecting algorithms.

Use a Boolean vector‑space model with a domain‑specific dictionary and information‑gain based feature selection for Chinese POI names.

Naive Bayes with Laplace smoothing and log‑space computation provides a fast, effective baseline; rule‑based removal of brand tokens can further improve accuracy at modest coverage loss.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature selectiontext classificationChinese NLPNaive BayesPOI classification
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.