Artificial Intelligence 16 min read

How Machine Learning Transforms Hotel Aggregation for Real‑Time Accurate Pricing

This article explains the evolution of hotel aggregation at Mafengwo, from simple cosine similarity matching to advanced machine‑learning pipelines using tokenization, feature engineering, and LightGBM models, highlighting challenges of accuracy and real‑time performance and presenting practical solutions.

Mafengwo Technology

Jan 16, 2020

How Machine Learning Transforms Hotel Aggregation for Real‑Time Accurate Pricing

Part.1 Application Scenarios and Challenges

Travelers need clean, comfortable hotels, and online booking makes this easier. Mafengwo aggregates hotel data from many suppliers, presenting a unified list that avoids duplicate information and enables real‑time price comparison across the web.

The strength of hotel aggregation determines the "thickness" of price options and influences personalized booking experiences. About 80% of aggregation tasks are now automated by machines.

1. Hotel Aggregation Application Scenario

Multiple suppliers provide overlapping hotel listings with varying descriptions. Aggregation consolidates these listings into a single view for one‑stop, real‑time price comparison.

The aggregated view shows all supplier quotes clearly, helping users make efficient booking decisions.

2. Challenges

(1) Accuracy

Different suppliers may describe the same hotel inconsistently. Aggregation errors can lead users to book the wrong hotel (an "AB store"), causing disastrous user experiences.

(2) Real‑time

Manual aggregation guarantees high accuracy but is too slow for the massive, constantly changing supplier data. Delays prevent timely price updates and waste human resources.

Part.2 Initial Solution: Cosine Similarity Algorithm

The early approach used cosine similarity on hotel name, address, and distance to reduce manual effort.

Input hotel A to be aggregated.

Search Elasticsearch for N hotels within 5 km of A with the highest similarity.

Perform pairwise comparisons between A and each of the N hotels.

Calculate overall name cosine similarity, address cosine similarity, and distance.

Apply manually set thresholds on similarity and distance to decide if hotels are the same.

This V1 workflow proved feasible with low implementation cost, but its simplicity caused many false positives, limiting automatic aggregation to about 30% and leaving 5% error (AB stores) that required manual review.

Part.3 Machine Learning in Hotel Aggregation

3.1 Tokenization

Coarse name and address matching was insufficient. Tokenization splits hotel names and addresses into structured tokens, enabling finer‑grained comparison and feature construction.

3.1.1 Token Dictionary

A statistical approach builds a dictionary by cutting millions of hotel names from both ends, counting token frequencies, and selecting high‑frequency tokens (hotel brands, types) for manual verification.

These tokens form the basis for name tokenization.

3.1.2 Name Tokenization

Human‑like tokenization separates brand, hotel, and location parts (e.g., "7 天" vs "如家"). Structured fields improve similarity calculations.

3.1.3 Address Tokenization

Address tokenization further refines granularity, as illustrated in the diagram.

3.2 Feature Construction

Tokenized fields are compared pairwise and converted into numerical feature vectors. Available data include hotel name, address, and phone (coordinates are unreliable; email coverage is low). Feature vectors combine similarity scores of name, address, and phone.

3.3 Algorithm Selection: Decision Tree & Boosting

The problem is a supervised binary classification (same vs different hotel). After evaluating algorithms, a decision‑tree‑based approach was chosen.

3.3.1 AdaBoost / Gradient Boosting

Boosting combines multiple weak learners to reduce error. AdaBoost re‑weights mis‑fit samples; Gradient Boosting fits residuals sequentially. Gradient Boosting is more widely used in industry.

3.3.2 XGBoost or LightGBM

Both are efficient implementations of Gradient Boosting. Benchmarks showed LightGBM uses less memory and trains faster while maintaining similar accuracy, so LightGBM was adopted.

3.4 Model Training & Iteration

3.4.1 Training Result Analysis

Initial results may be suboptimal; analysis of features, similarity calculations, or algorithm choice guides improvements.

3.4.2 Hyper‑parameter Tuning

Key parameters include max_depth and num_leaves (control tree complexity), feature_fraction and bagging_fraction (prevent over‑fitting), and regularization terms lambda_l1 and lambda_l2.

3.5 Model Effect

After multiple iterations, the model achieves >99.92% accuracy and >85.62% recall, meeting the strict accuracy requirement for hotel aggregation.

3.6 Solution Summary

The overall machine‑learning‑driven hotel aggregation pipeline is illustrated below.

Part.4 Final Thoughts

Future work includes unifying coordinate systems across domestic suppliers, closing the loop between risk control and aggregation, and extending the approach to overseas hotels, where different tokenization and stemming techniques are required.

feature engineering cosine similarity LightGBM hotel aggregation

Written by

Mafengwo Technology

External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.