Artificial Intelligence 14 min read

Overview of Sogou Information Feed Recommendation Algorithms

This article summarizes Sogou's information‑feed recommendation system, covering the architecture from data collection and NLP processing to recall, ranking, and feedback, and detailing the classification, tagging, keyword extraction, and various recall and ranking models such as FastText, TextCNN, collaborative filtering, and wide‑and‑deep learning.

DataFunTalk
DataFunTalk
DataFunTalk
Overview of Sogou Information Feed Recommendation Algorithms

DataFun Community – a platform for big‑data and algorithm learning.

This article is edited from Wang Dong’s talk at the DataFunTalk algorithm salon on June 9, where he presented Sogou’s information‑feed recommendation algorithm.

The recommendation system follows a classic architecture: data sources (crawled, partner, or self‑generated media) provide text and video content, which undergo NLP processing to extract abstract topics; a recall layer narrows billions of articles to a few thousand based on user interests; a personalized ranking layer orders the candidates; finally the results are displayed to the client, and user clicks feed back to update the user profile.

In Sogou, article NLP is the foundation. Articles are organized into three hierarchical dimensions: a broad classification (e.g., sports, entertainment), finer‑grained tags within each classification (e.g., NBA under sports), and keywords that sit between tags and classifications, representing user interest levels.

Classification is built with FastText multi‑class models. Hundreds of categories and dozens of hot categories are trained on hundreds of thousands of articles each. Initially, title and content are concatenated for training, achieving about 93 % accuracy. By separating title and content, adding a fully‑connected layer, and fusing multiple models, accuracy rises to roughly 96 %.

Tag prediction uses a TextCNN model trained on tens of thousands of tags, each with tens of thousands of articles. TextCNN differs from FastText in that an article can receive multiple tags. The model employs two convolutional layers followed by two fully‑connected layers, and data augmentation (splitting long articles into multiple samples) boosts tag accuracy to about 88 %; combined with augmentation and ensemble it reaches 90 %.

Keyword extraction cannot rely on classification due to the massive sample size. Two approaches are used: similarity models (TF‑IDF, LDA, word2vec) that compute vector similarity between words and the article, and a probabilistic model based on Skip‑Gram with hierarchical softmax. The latter models word transition probabilities and, after Bayesian selection, achieves around 89 % accuracy.

Recall algorithms are divided into content‑based, user‑based, and collaborative‑filtering methods. Content‑based recall uses explicit classifications, tags, and keywords (or latent dimensions from LDA). User‑based recall leverages click history. Collaborative filtering employs Item‑based, LFM, NCF, and other techniques, and can handle cold‑start scenarios through rule‑based or model‑based strategies.

CB (content‑based) recall maintains a weighted interest set for each user and predicts CTR using a logistic‑regression (LR) model. Features include article basics (format, length, hot keywords, author level, ingestion time), relevance features (keyword position, frequency, vector similarity), and popularity features (clicks, shares, dislikes, recall‑word heat).

Collaborative‑filtering recall builds a user‑item matrix; similarity between items (or between queries and items) drives recommendations. Matrix factorization transforms the matrix into user and item latent vectors, while NCF uses neural networks with shared embeddings to learn user‑item similarity.

The goal of recall is to shrink a billions‑scale pool to a manageable candidate set for personalized ranking. Ranking progresses from rule‑based methods to LR‑based CTR prediction, then to GBDT+LR, FTRL (online LR), and finally deep models such as wide‑and‑deep, which combine memorization (wide) and generalization (deep) capabilities.

Wide‑and‑deep models integrate logistic regression with deep neural networks; feature engineering covers article features (NLP, images, length, freshness), user features (interest, demographics), and cross features (e.g., user interest × content relevance). FM cross features share embeddings with the deep part, allowing both wide and deep representations without increasing parameters.

In summary, article NLP is the cornerstone of recommendation, feeding into user profiling and feature crossing. Multiple recall strategies and a blend of traditional and deep ranking models are employed; early‑stage products may use simpler methods, but as the system evolves, multi‑model fusion, diversity handling, cold‑start solutions, and quality control become essential.

—END

machine learningrecommendationRankingNLPSogouinformation feed
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.