Artificial Intelligence 12 min read

Product Matching in E‑commerce: Rule‑based, Feature‑Engineering, and Pure Data‑driven Approaches Using Factorization Machines

This article examines e‑commerce product matching, comparing rule‑based methods, feature‑engineering models, and a pure data‑driven Factorization Machine approach, detailing their advantages, challenges, training techniques, and successive optimizations to improve matching accuracy and operational efficiency.

Ctrip Technology

May 6, 2017

Product Matching in E‑commerce: Rule‑based, Feature‑Engineering, and Pure Data‑driven Approaches Using Factorization Machines

Author Bio : Liu Yang, algorithm engineer in the Search Department of 1hao Store, machine‑learning enthusiast and practitioner. He holds a PhD from Shanghai University with research in semantic analysis and knowledge discovery. This article is based on his talk at the Ctrip Technology Salon – Cloud Sea Machine Learning Meetup.

*The accompanying video (≈37 minutes) is provided by “IT大咖说”. More videos are available on the 大咖说 platform.*

E‑commerce aims to deliver an ultimate user experience through services (browsing, delivery, customer support) and products (quality, price, variety). Understanding competitors' product information, especially discovering matching relationships across sites, is crucial for pricing, selection, and category mapping.

Characteristics of Product Matching in E‑commerce

1. Unlike finding similar items, exact product matching requires all information to be consistent without conflicts, making it highly challenging. 2. Matching is performed via product titles, which are short and token‑sensitive; a single missing or extra word can break a match. Different sites follow varying naming conventions, and once a match is established it rarely changes unless a product is delisted.

Rule‑based Product Matching

Rules compare primary attributes such as brand, flavor, weight, etc. If all compared attributes are identical, the products are considered a match.

Advantages: Easy to intervene; mismatched cases can be quickly adjusted.

Disadvantages: As the rule tree grows, manual maintenance becomes difficult and priority determination among branches is ambiguous.

Feature‑engineering Based Product Matching

Feature engineering transforms raw data into descriptive features (e.g., brand consistency, color consistency, flavor consistency). These features are scored and fed into a supervised model for training and prediction.

Advantages: Focus on features and model; better features lead to simpler models and higher performance.

Disadvantages: Discovering good features is difficult; poor feature construction directly harms model performance.

Pure Data‑driven Product Matching

This approach treats every word in a title as a feature, resulting in a high‑dimensional sparse representation after one‑hot encoding. The Factorization Machine (FM) model is chosen because its pairwise feature interactions suit the sparse matrix scenario.

FM represents features with latent vectors; the interaction term is the dot product of two latent vectors, drastically reducing parameter count compared to full second‑order polynomial models and enabling efficient linear‑time training and inference.

Training Samples

Each pair of products forms a sample: label = 1 if they match, otherwise 0. After tokenizing titles, each word becomes a feature. The same word from 1hao Store and a competitor is treated as two distinct features.

Example feature: "480773:YHD_BRAND:康师傅" – where 480773 is the feature ID, YHD indicates it originates from 1hao Store, BRAND denotes the part‑of‑speech, and 康师傅 is the actual word.

Training Tips

Avoid extreme class imbalance; the raw positive‑to‑negative ratio is about 1:70. Each epoch samples negatives to achieve a 1:2 or 1:3 ratio.

Shuffle sample order for stochastic gradient descent each epoch.

Ensure sufficient training; after each epoch, output evaluation metrics on both training and test sets.

Pure Data‑driven Matching Optimizations

Optimization (1) – Remove Linear Terms

Linear terms reflect the contribution of single features, which cannot alone determine a match between two products. Removing them improves performance.

Optimization (2) – Restrict Cross‑terms to Features from Different Products

Cross‑terms are only formed between a feature from product A and a feature from product B; intra‑product feature combinations are disallowed, yielding a performance boost.

Optimization (3) – Restrict Cross‑terms to Features of the Same Part‑of‑Speech

Only features sharing the same POS (e.g., both brands or both colors) are allowed to interact, further improving accuracy.

Optimization (4) – Share the Same Latent Vector for Identical Words

Identical words across the two titles use a single latent vector, making their dot product larger and raising the match score, which yields additional performance gains.

Overall, the pure data‑driven method requires no manual feature definition, but it is harder to intervene and correct erroneous samples.

Advantages: No need for handcrafted features.

Disadvantages: Difficult to intervene; errors are hard to correct.

Outlook

1hao Store currently employs rule‑based, feature‑engineering, and pure data‑driven matching, primarily using textual information. Future work aims to incorporate product images and heterogeneous data sources to further improve matching precision and recall, thereby reducing manual effort for operations teams.

Download PPT : Please click “Read Original Article” at the bottom of the page.

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.