Artificial Intelligence 9 min read

Semantic Text Understanding for NetEase News Feed Recommendation

NetEase improves its news‑feed recommendation by applying a multi‑stage semantic text understanding pipeline—lexical analysis, hierarchical content tagging, and quality filtering—using two‑level classifiers, LDA‑based topic modeling, multi‑label concept and entity extraction, and dense vector representations to better capture user interests and boost personalization performance.

NetEase Media Technology Team

Jun 12, 2020

Semantic Text Understanding for NetEase News Feed Recommendation

The article introduces NetEase's news feed recommendation scenario and explains how text semantic understanding is applied to improve personalized content delivery.

Business Background : NetEase's news client offers diverse content formats (articles, videos, live streams, Q&A) and uses a recommendation system to match users' reading interests and click behavior. Traditional recommendation approaches include collaborative filtering and content‑based methods, both relying heavily on feature engineering.

Feature Types : Three major feature groups are described – content features (structured representations of text, images, video), user features (demographic and behavioral attributes), and context features (time and location). Accurate extraction of content features is emphasized as the foundation for building reliable user and item profiles.

Semantic Understanding Architecture : The solution consists of three modules:

Basic Services – lexical analysis (segmentation, POS tagging, entity recognition, synonym handling) and knowledge graph support.

Content Understanding – transforms raw text into structured representations such as categories, topics, semantic tags.

Quality Understanding – filters low‑quality or inappropriate news (duplicate detection, profanity, ad detection).

The focus of the article is on the Content Understanding module.

Hierarchical Interest Feature System : Features are organized by granularity: Category → Topic → Point‑of‑Interest (POI) → Regular Tag → Keyword. The system evolved from only categories and keywords to this multi‑level taxonomy to better capture user interests.

Implementation Details :

Text Classification : Two‑level classifiers. Level‑1 (coarse categories) uses complex models (initially XGBoost, later TextCNN, fastText, BERT). Level‑2 (fine categories) uses lightweight models (LightGBM, Naïve Bayes) due to the large number of classifiers.

Topic Modeling : LDA‑based pipeline. Word2Vec is trained on massive corpora to obtain word vectors; topics are derived per top‑level category, refined with K‑Means clustering, forming a three‑level topic tree. In inference, article vectors are computed via weighted average of keyword vectors and matched to topic vectors.

Concept Tag Extraction : Treated as a multi‑label classification problem. Models similar to text classification (XGBoost, TextCNN, BERT) are used, with the final layer switched from Softmax to Sigmoid and thresholds applied to obtain multiple tags per article.

Entity Tag Extraction : Named entities are recognized using a supervised BERT+BiLSTM+CRF model. Ordinary entities are extracted unsupervisedly via a pipeline of TF‑IDF + TextRank for keyword extraction, Word2Vec for vectorization, dictionary matching, and rule‑based filtering.

Vector Representation : Explicit tags provide clear semantics but require high manual maintenance. Implicit dense vectors (One‑Hot → Word2Vec → BERT/ERNIE) are leveraged for similarity measurement and integration with deep learning recommendation models.

Conclusion : The semantic tagging system and its algorithms play a crucial role in modeling user interests and content attributes, delivering strong performance in online experiments. Future work will combine user behavior signals and knowledge graphs to further enhance content understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning feature engineering Recommendation Systems NLP news feed text semantics

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.