Artificial Intelligence 15 min read

From Naïve Algorithms to Scalable Recommendations: Jiayuan’s Journey

This article chronicles the evolution of Jiayuan’s dating recommendation system from early item‑based kNN experiments through a feature‑engineering focused engineering year and a product‑oriented optimization phase, while also reviewing several advanced machine‑learning techniques the author explored.

21CTO

Jan 6, 2016

From Naïve Algorithms to Scalable Recommendations: Jiayuan’s Journey

Summarizing and reviewing are the two points that make people grow, not how fast you move.

I have been writing this sentence for half a year; this article marks the practical start of putting it into action.

The article is a personal summary of the technologies I have encountered over the past few years. It is divided into two parts: the first part describes the development history of Jiayuan’s user recommendation system, reflecting our thinking and learning process; the second part lists representative academic‑oriented techniques, which can be ignored by readers only interested in practical implementation.

Jiayuan User Recommendation System

Naïve Algorithm Years: 2011‑2013

In August 2011 I joined Jiayuan and initially worked on optimizing the dating recommendation system. The team consisted of three people and focused on recommendation and support for new product interfaces. Lacking industrial experience, I started with familiar recommendation algorithms.

During 2011‑2013 we tried two directions. The first was an item‑based kNN algorithm, optimized in three ways:

Offline computation efficiency: from single‑machine to Hadoop distributed computing.

Offline computation effectiveness: experimented with different similarity measures and prediction methods, but gains were limited.

Real‑time online computation: moved from pre‑computed offline results stored in cache to a real‑time Java recommendation service built on Dubbo.

The item‑based kNN was initially driven by maximizing the number of messages sent by users, but we discovered a mismatch with business needs: showing attractive profiles to men caused a few women to receive most messages while many received none, which the product team opposed. Adjustments to balance other metrics (e.g., received messages) were ineffective.

The second attempt was the academic Reciprocal Recommendation algorithm, aiming to consider both user and item (person) experience. This effort largely failed because the algorithm’s assumptions did not hold in practice.

By 2013 the team grew to six or seven members, but only about two people continued focusing on recommendation algorithms.

Engineering Year: 2014

Starting late 2013 I realized that overly academic algorithms could not satisfy business requirements, so I re‑oriented the team toward feature engineering and kept the algorithm simple: Logistic Regression . Logistic regression is easy to interpret, has solid open‑source implementations, and allows the team to accumulate experience in data handling and business understanding.

We treated each step of the conversion funnel (candidate generation → ranking → post‑processing) as an independent problem, applying logistic regression to each. This tactical choice separated the “operational” needs (candidate generation) from the “user” needs (ranking), simplifying optimization.

We also designed a system flow where the first step satisfies operational goals to produce a candidate set, and the second step ranks the candidates according to user preferences. This separation allowed independent optimization of each stage.

2014 was undeniably the engineering year.

Product Year: 2015

In 2014 the tactical focus on feature engineering yielded significant metric improvements, many exceeding 50%. However, we realized that operational and user needs are intertwined; optimizing one often impacts the other.

We adopted a “candidate generation → ranking → post‑processing” pipeline, similar to search and advertising systems. The candidate generation stage retrieves a set of potential matches based on mutual criteria or algorithms such as kNN or AR. The ranking stage maximizes product goals, which include both operational KPIs and user satisfaction.

For 2015 we adjusted our strategy:

Separate operational requirements that have low coupling with user needs and control them with rules, because Jiayuan’s business logic is too complex for pure algorithmic solutions.

Unified optimization of coupled operational and user requirements, allowing the ranking system to consider both simultaneously and letting the candidate generation stage reflect this integration.

Thus 2015 became the “product year” of the recommendation system, with the explicit goal of optimizing product objectives rather than merely serving users.

Key Takeaways

Technology serves the product, not directly the user.

Data quality is the foundation; ensuring high quality is challenging.

Defining correct optimization metrics is difficult.

Business understanding > engineering implementation.

Data > system > algorithm.

Rapid experimentation and iteration.

Technical Experiments

Dirichlet Process and Dirichlet Process Mixture Model : Non‑parametric Bayesian clustering that can automatically infer the number of clusters, though it has many practical drawbacks.

Latent Dirichlet Allocation (LDA) : Used for text clustering; applied to message data to discover a clear‑cut cluster of users who send messages to divorced partners, illustrating that LDA may be better for preprocessing (e.g., dimensionality reduction) than direct clustering.

Alternating Direction Method of Multipliers (ADMM) : An optimization framework that decomposes large problems into distributed sub‑problems; powerful but with many caveats.

Deep Learning : Tested common DL models on user avatar data to predict gender, achieving 87% accuracy on a small training set, but high hyper‑parameter tuning cost makes it impractical without strong compute resources.

GBDT‑based Feature Construction : Followed Facebook’s approach of using Gradient Boosted Decision Trees to generate new features; proved reliable when computational efficiency is acceptable.

Feature Hashing : A technique for handling massive numbers of sparse features; not yet adopted in our pipeline.

Imbalanced Data Sampling : Applied sampling methods from literature to address severe class imbalance; showed modest benefit when data volume is too large to train directly.

End.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature engineering Recommendation Systems logistic regression online dating

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.