Lead Quality Prediction for Real Estate: Data, Modeling, and Interpretability
This article presents a case study on building and deploying a lead‑quality classification model for a high‑value, low‑frequency real‑estate platform, covering business context, data challenges, sampling strategies, feature engineering, model selection, tuning, evaluation metrics, interpretability analysis, and observed performance improvements.
The business operates in an algorithm‑unfriendly real‑estate scenario where user interactions are sparse, sample sizes are limited, and the sales cycle is long; leads collected online are noisy and must be filtered by customer service before being passed to offline consultants.
An internal algorithm platform was built with flexible configuration, supporting monitoring, task scheduling, model validation, visualization, and A/B testing, and it integrates multiple engines such as Spark‑MLlib, XGBoost, TensorFlow, and PyTorch via pipelines.
Because the conversion rate from lead to purchase is extremely low, a proxy target—whether a lead results in a house‑viewing—was used to define positive and negative samples; techniques like undersampling, oversampling (e.g., SMOTE), and weighting were applied to mitigate class imbalance.
Feature engineering grouped variables into three types: source information (channel, device, fraud detection), app‑behavior features (clicks, searches, page views), and user‑stickiness metrics (activity frequency, recentness, depth and breadth of navigation). In deep‑learning settings embeddings could replace extensive engineering, but here careful preprocessing (discretization, conversion rates, PCA, tree‑path features) remained important.
Model training employed traditional classifiers—Logistic Regression, Random Forest, XGBoost, LightGBM—and some deep‑learning attempts; hyper‑parameter optimization used grid search, manual tuning, and Bayesian optimization, while city‑specific thresholds addressed regional behavior differences.
Business‑aligned evaluation metrics focused on recall at 30 % and 70 % thresholds (Recall@30%, Recall@70%); thresholds were chosen so that a selected top‑percentage of leads yielded a high proportion of viewings or purchases, directly reflecting operational impact.
Interpretability was performed with SHAP, revealing that phone‑call features, channel source, and recentness of the lead were most influential, while some high‑frequency page‑view features surprisingly correlated with lower lead quality.
Deployment results showed a 17‑percentage‑point lift in purchase conversions and a reduction of the lead‑to‑viewing cycle from two months to two weeks; future work includes incorporating offline interaction data, cross‑device integration, and model stacking to further improve performance.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.