How Machine Learning Transforms Hotel Aggregation for Real‑Time Accurate Pricing
This article explains the evolution of hotel aggregation at Mafengwo, from simple cosine similarity matching to advanced machine‑learning pipelines using tokenization, feature engineering, and LightGBM models, highlighting challenges of accuracy and real‑time performance and presenting practical solutions.
Part.1 Application Scenarios and Challenges
Travelers need clean, comfortable hotels, and online booking makes this easier. Mafengwo aggregates hotel data from many suppliers, presenting a unified list that avoids duplicate information and enables real‑time price comparison across the web.
The strength of hotel aggregation determines the "thickness" of price options and influences personalized booking experiences. About 80% of aggregation tasks are now automated by machines.
1. Hotel Aggregation Application Scenario
Multiple suppliers provide overlapping hotel listings with varying descriptions. Aggregation consolidates these listings into a single view for one‑stop, real‑time price comparison.
The aggregated view shows all supplier quotes clearly, helping users make efficient booking decisions.
2. Challenges
(1) Accuracy
Different suppliers may describe the same hotel inconsistently. Aggregation errors can lead users to book the wrong hotel (an "AB store"), causing disastrous user experiences.
(2) Real‑time
Manual aggregation guarantees high accuracy but is too slow for the massive, constantly changing supplier data. Delays prevent timely price updates and waste human resources.
Part.2 Initial Solution: Cosine Similarity Algorithm
The early approach used cosine similarity on hotel name, address, and distance to reduce manual effort.
Input hotel A to be aggregated.
Search Elasticsearch for N hotels within 5 km of A with the highest similarity.
Perform pairwise comparisons between A and each of the N hotels.
Calculate overall name cosine similarity, address cosine similarity, and distance.
Apply manually set thresholds on similarity and distance to decide if hotels are the same.
This V1 workflow proved feasible with low implementation cost, but its simplicity caused many false positives, limiting automatic aggregation to about 30% and leaving 5% error (AB stores) that required manual review.
Part.3 Machine Learning in Hotel Aggregation
3.1 Tokenization
Coarse name and address matching was insufficient. Tokenization splits hotel names and addresses into structured tokens, enabling finer‑grained comparison and feature construction.
3.1.1 Token Dictionary
A statistical approach builds a dictionary by cutting millions of hotel names from both ends, counting token frequencies, and selecting high‑frequency tokens (hotel brands, types) for manual verification.
These tokens form the basis for name tokenization.
3.1.2 Name Tokenization
Human‑like tokenization separates brand, hotel, and location parts (e.g., "7 天" vs "如家"). Structured fields improve similarity calculations.
3.1.3 Address Tokenization
Address tokenization further refines granularity, as illustrated in the diagram.
3.2 Feature Construction
Tokenized fields are compared pairwise and converted into numerical feature vectors. Available data include hotel name, address, and phone (coordinates are unreliable; email coverage is low). Feature vectors combine similarity scores of name, address, and phone.
3.3 Algorithm Selection: Decision Tree & Boosting
The problem is a supervised binary classification (same vs different hotel). After evaluating algorithms, a decision‑tree‑based approach was chosen.
3.3.1 AdaBoost / Gradient Boosting
Boosting combines multiple weak learners to reduce error. AdaBoost re‑weights mis‑fit samples; Gradient Boosting fits residuals sequentially. Gradient Boosting is more widely used in industry.
3.3.2 XGBoost or LightGBM
Both are efficient implementations of Gradient Boosting. Benchmarks showed LightGBM uses less memory and trains faster while maintaining similar accuracy, so LightGBM was adopted.
3.4 Model Training & Iteration
3.4.1 Training Result Analysis
Initial results may be suboptimal; analysis of features, similarity calculations, or algorithm choice guides improvements.
3.4.2 Hyper‑parameter Tuning
Key parameters include max_depth and num_leaves (control tree complexity), feature_fraction and bagging_fraction (prevent over‑fitting), and regularization terms lambda_l1 and lambda_l2.
3.5 Model Effect
After multiple iterations, the model achieves >99.92% accuracy and >85.62% recall, meeting the strict accuracy requirement for hotel aggregation.
3.6 Solution Summary
The overall machine‑learning‑driven hotel aggregation pipeline is illustrated below.
Part.4 Final Thoughts
Future work includes unifying coordinate systems across domestic suppliers, closing the loop between risk control and aggregation, and extending the approach to overseas hotels, where different tokenization and stemming techniques are required.
Mafengwo Technology
External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
