Artificial Intelligence 12 min read

How Content Features Power Modern Recommendation Systems

Content features transform unstructured entities like articles, images, and videos into structured descriptors—such as categories, tags, and keywords—enabling precise search recall, personalized recommendations, and effective labeling through methods like classification, convergent tags, keyword extraction, and both manual and automated annotation.

Baixing.com Technical Team

Nov 29, 2017

How Content Features Power Modern Recommendation Systems

In the previous article we introduced recommendation systems from a global perspective; this article details an important module—content features.

What Are Content Features

Content features describe real-world entities using structured information, turning complex content into simple descriptors. For example, an article’s content features include its topic, keywords, and category; an image’s content features include described objects, brightness, and tone.

The Significance of Content Features

Content features abstract structured data from semi-structured and unstructured entities, providing precise descriptions.

As summaries of posts or videos, content features improve search recall and ranking: instead of merely matching keywords and sorting by time, they allow relevance‑based ordering based on the importance of the search term within the content features.

Content features also serve as a precursor to user tags for recommendation. By aggregating content features of posts a user interacts with, we can infer user interests, such as labeling a user as “car enthusiast” if many of their viewed posts have a “car” content feature.

Dimensions and Computation Methods of Content Features

Classification

Classification uses categorical data to represent entity features.

Classification sites naturally have hierarchical taxonomies—from broad categories like lifestyle services to fine‑grained subcategories—providing ready‑made content features.

News sites classify content into entertainment, finance, military, etc., and also by region, all of which serve as content features.

Convergent Tags

Convergent tags represent content features through a multi‑step process, illustrated with an example from Baixing.com:

Step 1: Define the tags needed for the business.

Step 2: Identify articles containing these tags, extract keywords, and compute normalized weights using TF‑IDF or Word2Vec. Example: For posts containing the “car” tag, we obtain:

汽车: {"日系": 0.12, "轴承": 0.13, "白色": 0.09, "底盘": 0.21}

Step 3: Tokenize a specific post (ID 91367) and count term frequencies:

91367: {"底盘": 1, "日系": 2, "九成新": 1, "白色": 2}

Step 4: Match post term frequencies with tag keywords, sum weighted scores, and compute the final tag relevance. For the example: 0.12×2 + 0.09×2 + 0.21×1 = 0.63 → 汽车: 0.63 The overall process is illustrated below:

Keywords

Keywords summarize a text segment by extracting representative words, using algorithms such as TF‑IDF, TextRank, or simple term frequency.

An example of keyword extraction is shown below:

Annotation of Content Features

Three Methods

Manual Annotation Domain experts read articles or watch videos and provide abstract summaries. This yields high accuracy but is costly and not scalable; it is often used for initial corpus creation and as positive samples for models.

User Annotation Users optionally select or input tags when posting content. Accuracy varies with user diligence.

Automatic Annotation Automatic annotation leverages keyword extraction for text and, for videos, combines image frame recognition to generate tags.

Comparison of the Three Methods

In terms of accuracy: Manual > Automatic > User. In terms of controllability: Automatic > Manual > User. Due to cost considerations, automatic annotation is typically preferred, with manual and user annotations serving as validation references.

Evaluation Methods for Content Features

Two primary evaluation approaches are used:

Manual Sampling Human reviewers assess the correctness of generated tags. This yields accurate judgments but may suffer from sampling bias and low efficiency.

User Evaluation After posting, the author reviews the assigned content features to confirm their accuracy.

Content Feature Extraction Solutions for Different Entity Types

News

Extraction focuses on:

Classification: e.g., entertainment, sports, finance.

Tags: derived using KeyGraph, TF‑IDF, etc.

Keywords: similar techniques as tags.

Images

Image Classification: using deep learning models such as VGG19, ResNet50.

Image Tags and Keywords: extracted from image titles and surrounding textual descriptions.

Videos

Extraction involves:

Text: titles and descriptions for keywords and tags.

Audio: speech‑to‑text conversion for keyword extraction.

Video Frames: image recognition on key frames to obtain classification tags, acknowledging some information loss.

Summary

Content features structure unordered entities, enriching search results and enabling precise user profiling. Implementing them is challenging due to the diversity of real‑world entities, and extracting features for images and videos incurs higher computational costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Recommendation Systems Tagging keyword extraction automatic annotation content features

Written by

Baixing.com Technical Team

A collection of the Baixing.com tech team's insights and learnings, featuring one weekly technical article worth following.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.