Housing Price Estimation and Average Price Calculation Using 58.com Data and CatBoost
This article presents a comprehensive overview of 58.com’s real‑estate price system, describes how average prices are computed from platform data, explains three anomaly‑detection methods, and details a CatBoost‑based machine‑learning model for automated house valuation, including feature engineering and evaluation metrics.
The session, hosted by DataFunTalk, features Zhou Tong, a senior algorithm engineer from 58.com, who explains how the platform leverages existing housing features such as listing prices to perform valuation.
1. Platform Overview – 58.com and Anjuke are China’s largest real‑estate information platforms, covering new homes, second‑hand homes, rentals, and commercial properties. Data sources include operations teams, agents, new‑home consultants, KOLs, and internal data calculations.
2. Real‑Estate Price Services – Services include price lookup, price reports, price maps, transaction queries, and valuation. The price system consists of five components: average price calculation, transaction data, house‑level valuation, agent valuation, and a benchmark price (磐石价格).
3. Average Price Calculation – Four methods are compared: product‑home price (most accurate but least timely), transaction average price, listing average price (best trade‑off), and government‑listed price (available only in Shenzhen). The listing average price is chosen for its timeliness.
4. Issues with Listing Prices – Agents often list prices lower to attract buyers, so the median is not used; instead a percentile above 50 is selected. Outliers are also removed due to intra‑community price variance.
5. Anomaly Detection Methods
Z‑Score – flags samples deviating more than two standard deviations from the mean.
DBSCAN – density‑based clustering that marks sparsely populated points as anomalies.
Isolation Forest – isolates points quickly; shorter paths indicate higher anomaly likelihood.
Model effectiveness is evaluated using the coefficient of variation (stability) of community average prices after outlier removal.
6. House Valuation
Traditional market‑comparison method selects similar recent transactions, adjusts for status, date, region, and age, then averages the adjusted prices. To automate this, a machine‑learning model is built.
The model uses real‑time listing data to predict a log‑error between transaction price and unit price, applying a log transformation to stabilize variance.
Because many features are categorical (city, district, floor, layout, etc.), CatBoost is chosen for its ability to handle categorical variables, automatic feature combinations, reduced hyper‑parameter tuning, multi‑GPU support, and regularization via symmetric trees.
CatBoost training employs ordered boosting: each sample is predicted using a model trained on previously seen samples, reducing bias and over‑fitting. The final model uses roughly 20 engineered features and is updated weekly.
7. Q&A Highlights
Policy factors are not yet modelled; they are handled manually.
Factorization Machines were tried but performed worse than CatBoost.
Region IDs are fed directly as categorical features.
Missing values are imputed with mode or mean.
Log‑error is used to better capture relative differences.
City ID is the most important feature for price prediction.
The presentation concludes that CatBoost’s handling of high‑cardinality categorical data and its ordered boosting make it highly suitable for large‑scale real‑estate price estimation tasks.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.