How Content Features Power Modern Recommendation Systems
Content features transform unstructured entities like articles, images, and videos into structured descriptors—such as categories, tags, and keywords—enabling precise search recall, personalized recommendations, and effective labeling through methods like classification, convergent tags, keyword extraction, and both manual and automated annotation.
In the previous article we introduced recommendation systems from a global perspective; this article details an important module—content features.
What Are Content Features
Content features describe real-world entities using structured information, turning complex content into simple descriptors. For example, an article’s content features include its topic, keywords, and category; an image’s content features include described objects, brightness, and tone.
The Significance of Content Features
Content features abstract structured data from semi-structured and unstructured entities, providing precise descriptions.
As summaries of posts or videos, content features improve search recall and ranking: instead of merely matching keywords and sorting by time, they allow relevance‑based ordering based on the importance of the search term within the content features.
Content features also serve as a precursor to user tags for recommendation. By aggregating content features of posts a user interacts with, we can infer user interests, such as labeling a user as “car enthusiast” if many of their viewed posts have a “car” content feature.
Dimensions and Computation Methods of Content Features
Classification
Classification uses categorical data to represent entity features.
Classification sites naturally have hierarchical taxonomies—from broad categories like lifestyle services to fine‑grained subcategories—providing ready‑made content features.
News sites classify content into entertainment, finance, military, etc., and also by region, all of which serve as content features.
Tags
Tags are generic descriptors that can be keywords, categories, or key characteristics. Excessive tags can overwhelm management and reduce recall effectiveness, so convergent tags are defined based on business needs, such as top tags or custom tags for specific scenarios.
Convergent Tags
Convergent tags represent content features through a multi‑step process, illustrated with an example from Baixing.com:
Step 1: Define the tags needed for the business.
Step 2: Identify articles containing these tags, extract keywords, and compute normalized weights using TF‑IDF or Word2Vec. Example: For posts containing the “car” tag, we obtain:
汽车: {"日系": 0.12, "轴承": 0.13, "白色": 0.09, "底盘": 0.21}Step 3: Tokenize a specific post (ID 91367) and count term frequencies:
91367: {"底盘": 1, "日系": 2, "九成新": 1, "白色": 2}Step 4: Match post term frequencies with tag keywords, sum weighted scores, and compute the final tag relevance. For the example:
0.12×2 + 0.09×2 + 0.21×1 = 0.63 → 汽车: 0.63The overall process is illustrated below:
Keywords
Keywords summarize a text segment by extracting representative words, using algorithms such as TF‑IDF, TextRank, or simple term frequency.
An example of keyword extraction is shown below:
Annotation of Content Features
Three Methods
Manual Annotation Domain experts read articles or watch videos and provide abstract summaries. This yields high accuracy but is costly and not scalable; it is often used for initial corpus creation and as positive samples for models.
User Annotation Users optionally select or input tags when posting content. Accuracy varies with user diligence.
Automatic Annotation Automatic annotation leverages keyword extraction for text and, for videos, combines image frame recognition to generate tags.
Comparison of the Three Methods
In terms of accuracy: Manual > Automatic > User. In terms of controllability: Automatic > Manual > User. Due to cost considerations, automatic annotation is typically preferred, with manual and user annotations serving as validation references.
Evaluation Methods for Content Features
Two primary evaluation approaches are used:
Manual Sampling Human reviewers assess the correctness of generated tags. This yields accurate judgments but may suffer from sampling bias and low efficiency.
User Evaluation After posting, the author reviews the assigned content features to confirm their accuracy.
Content Feature Extraction Solutions for Different Entity Types
News
Extraction focuses on:
Classification: e.g., entertainment, sports, finance.
Tags: derived using KeyGraph, TF‑IDF, etc.
Keywords: similar techniques as tags.
Images
Image Classification: using deep learning models such as VGG19, ResNet50.
Image Tags and Keywords: extracted from image titles and surrounding textual descriptions.
Videos
Extraction involves:
Text: titles and descriptions for keywords and tags.
Audio: speech‑to‑text conversion for keyword extraction.
Video Frames: image recognition on key frames to obtain classification tags, acknowledging some information loss.
Summary
Content features structure unordered entities, enriching search results and enabling precise user profiling. Implementing them is challenging due to the diversity of real‑world entities, and extracting features for images and videos incurs higher computational costs.
Baixing.com Technical Team
A collection of the Baixing.com tech team's insights and learnings, featuring one weekly technical article worth following.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.