Artificial Intelligence 14 min read

Second‑hand Housing Recommendation System: Business Background, Vector Recall, Multi‑objective Optimization and Future Plans

This article presents the end‑to‑end practice of a second‑hand housing recommendation system at 58.com and Anjuke, covering business background, embedding‑based vector recall, multi‑objective ranking methods such as ESMM and MMOE, experimental results, and future development directions.

DataFunTalk

Sep 19, 2021

Second‑hand Housing Recommendation System: Business Background, Vector Recall, Multi‑objective Optimization and Future Plans

58.com and Anjuke are the largest platforms for house hunting in China, serving millions of agents and tens of millions of users. In this business scenario we monitor multiple metrics such as clicks, micro‑chat and phone calls, and share the practice of a second‑hand housing recommendation system with a focus on multi‑objective ranking algorithms.

Business background & scenario

The platform includes both 58.com and Anjuke, with daily active users exceeding 5 million. It covers new houses, second‑hand houses, rentals, and commercial properties, with second‑hand houses being the core business. Users browse listings, then use phone, micro‑chat, VR tours or appointment features to communicate with agents; successful communication leads to offline viewings and eventual transactions.

Compared with traditional recommendation systems, the second‑hand housing funnel is longer and deeper, involving both online and offline stages.

Recommendation scenarios

Recommendations appear on the homepage, category pages, channel pages, zero‑result pages, and detail pages (e.g., same‑price recommendations, "also viewed", cross‑business recommendations such as building or decoration suggestions).

Recommendation architecture

The architecture consists of a data layer and a model layer. The data layer collects client and server logs in real time, performs offline/online feature engineering to generate user, item and relational features. The model layer follows the typical pipeline of recall, ranking and re‑ranking.

Vectorized recall

Embedding evolution : initially a Skip‑gram model trained on user‑browsed house sequences, then a Graph Embedding (DeepWalk) to capture house co‑occurrence in a graph, and finally EGES which incorporates side information (district, price, etc.) to alleviate cold‑start problems.

Real‑time embedding recall solutions :

Solution 1: Retrieve items from a Faiss index using the embedding of the most recently browsed house (fast, simple).

Solution 2: Retrieve separately for each browsed house, then merge and take the top‑N (slower but better performance).

Solution 3: Pre‑compute similar‑house embeddings offline (item‑based collaborative filtering) and merge at inference (fast, high quality, but requires large storage).

Embedding fusion :

Approach 1: Maintain a separate Faiss index for each embedding and query all of them (high maintenance cost).

Approach 2: Normalize each embedding, weight them, and concatenate into a single vector for a single Faiss query (same effect as approach 1 with lower cost).

Multi‑objective optimization

Why multi‑objective? The ultimate goal is transaction, but the funnel includes impression → click → connection (call, chat, VR) → viewing → transaction. Optimizing only CTR may bias towards low‑price houses; we need to also optimize the connection conversion rate.

Methods:

Multi‑model fusion: a Wide&Deep CTR model and a connection‑conversion model are trained separately and combined with a linear model.

ESMM (Entire Space Multi‑Task Model): introduces auxiliary tasks pCTR and pCTCVR to eliminate sample‑selection bias and data sparsity, sharing embeddings between tasks.

Loss function: Focal Loss is used to address class imbalance.

Training uses 60 million samples, updated daily, with Adam optimizer, Dropout, L2 regularization, Batch Normalization, and various learning‑rate/Batch‑Size experiments.

MMOE experiments

MMOE (Multi‑gate Mixture‑of‑Experts) introduces task‑specific gates on shared expert networks, allowing soft sharing. In our scenario MMOE performed worse than ESMM because click and connection are highly dependent.

Future direction: combine ideas from Tencent to split MMOE into two tasks (CTR and CVR) each learned with FM, then jointly train a CTCVR model.

Summary & planning

Three focus areas: (1) data – continue to integrate more online and offline signals to better model user interests; (2) recall – keep exploring high‑quality, low‑latency recall methods; (3) ranking – acquire more downstream behavior data (viewings, transactions) to optimize beyond CTR and connection.

Q&A

Q: How often is EGES updated and how is the graph stored? A: Training is performed every three days due to data volume; the graph is kept in memory, random walks are generated with Spark, and the resulting sequences are stored on HDFS for model training.

Q: What is the scale of embeddings in Faiss and which index type is used? A: About 20 million house embeddings, indexed with Faiss IndexIVFFlat.

Q: Is distributed training used? What is the online inference latency and feature count? A: Currently a single‑GPU training setup is used; inference latency is ~5 ms and each sample has 400–500 features.

Q: What optimizations reduce inference time? A: ESMM’s long‑short architecture, the wpai online inference service, and continuous platform optimizations.

Thank you for listening.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation system FAISS Embedding vector recall real estate multi-objective optimization ESMM

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.