Artificial Intelligence 14 min read

A 2‑Channel CNN Method for Automatic Game Asset Tag Generation and Similarity Recommendation

The paper introduces an improved two‑channel CNN, built on a shared VGG16 backbone and a hinge‑loss metric, to automatically generate numeric tags for game advertising assets by learning content and style similarity, achieving over 97% test accuracy and enabling efficient ad placement and asset management.

37 Interactive Technology Team
37 Interactive Technology Team
37 Interactive Technology Team
A 2‑Channel CNN Method for Automatic Game Asset Tag Generation and Similarity Recommendation

Business background: Manual labeling of game advertising assets often leads to insufficient or subjective tags, causing duplicate tags, missing tags, and inaccurate recommendations. These issues affect both ad delivery and asset management.

To address these problems, a new automatic tagging method is proposed. Considering that asset attractiveness depends on content and style, a deep‑learning solution based on an improved 2‑channel network is adopted to generate numeric tags for assets.

Algorithm principle: Tags are one‑hot encoded into matrices. The goal is that similar assets have small tag distances. A CNN receives the asset image and outputs a similarity distance; matching pairs are labeled y=0, non‑matching pairs y=1.

The 2‑channel network is an evolution of a siamese network. Two input patches (X₁ and X₂) are combined into a dual‑channel tensor (2 × 64 × 64) and processed by a shared‑weight backbone. VGG16’s 13 convolutional layers are reused as feature extractors, while the fully‑connected layers are redesigned to output the asset tag.

Input‑output design: Positive samples are two frames from the same asset (Y=1); negative samples are frames from different assets (Y=‑1). The network outputs two feature vectors G(x₁) and G(x₂); their Euclidean distance D(G(x₁), G(x₂)) serves as the similarity metric.

Loss function: The loss must be differentiable and encourage small distances for positive pairs and large distances for negative pairs. A hinge‑loss‑based formulation is used, penalizing distances that do not satisfy the margin m.

Style quantification: While CNNs excel at content extraction, style is captured by computing the Gram matrix of feature maps from a chosen VGG16 layer. The Gram matrix encodes the correlation between feature maps, effectively representing image style.

Algorithm implementation – system pipeline: Use FFmpeg to extract frames from video assets, producing JPG files. Transfer frames via rsync to a deep‑learning server and preprocess them for VGG16 input. Implement the 2‑channel network with TensorFlow/Keras to generate asset tags. Compute and store pairwise Euclidean distances between tags.

Data preparation includes frame extraction, VGG16‑compatible preprocessing, and construction of positive/negative sample pairs. Training uses the redesigned network with three new fully‑connected layers (ReLU activation) on top of the frozen VGG16 convolutional base.

Training results: After 20 epochs, the model reaches 99.60 % accuracy on the training set and 97.47 % on the test set. The trained parameters are saved for online inference.

Model deployment: An online TensorFlow/Keras environment loads the saved weights; the final fully‑connected layer output serves as the asset tag, and Euclidean distances are used for similarity ranking.

Applications: Ad placement: Replacing manual tags with algorithmic tags improves A/B test performance (68 % of plans with algorithmic tags outperform manual tags). Asset management: Enables similarity‑based asset retrieval, bulk upload with automatic classification, and rapid tag loading via serialization.

Current limitations: (1) Video assets are only represented by static frames, lacking temporal semantics. (2) Rapid style changes within a single asset can degrade tag accuracy. (3) No handling of IP‑specific constraints.

Future outlook: (1) Extend the network to capture temporal information (e.g., 3‑D convolutions over multiple frames). (2) Incorporate more frames and use LSTM to model style evolution. (3) Integrate IP databases for constrained recommendation.

CNNDeep LearningSimilarity Detection2-channel networkgame advertisingimage tagging
37 Interactive Technology Team
Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.