Artificial Intelligence 14 min read

How YOLOv3 Boosts Video Content Advertising on Youku: A Real‑World Case Study

By integrating YOLOv3 video object detection into Youku’s ad platform, the team replaced traditional subtitle‑based and scene‑based placements with precise object‑level targeting, achieving higher relevance, expanded inventory, and a 20% click‑through increase despite 3.5× higher exposure.

Alibaba Cloud Developer

Jun 12, 2019

How YOLOv3 Boosts Video Content Advertising on Youku: A Real‑World Case Study

Background

Traditional video site ads rely on pre‑roll, mid‑roll, and post‑roll clips, which cannot be skipped by non‑members and degrade user experience. Content‑based ads that blend with video scenes avoid interrupting viewing and are becoming the mainstream.

Current Youku Content‑Ad Situation

Youku obtains ad slots from subtitle keywords (via OCR + NLP) and scene recognition. Subtitle‑based ads suffer from mismatched semantics and irrelevant placement, while scene recognition is costly, subjective, and hard to extend.

Why Video Object Detection?

Objects lie between subtitles and scenes: they appear even without related subtitles and constitute the basic elements of a scene. Object detection therefore solves subtitle‑text mismatch, improves continuity, simplifies modeling compared to scene recognition, and enables rapid expansion of detectable categories.

Detection Technology Overview

Object detection has evolved from DPM to deep‑learning models. Two main families exist: two‑stage detectors (e.g., Fast/Faster R‑CNN) offering higher accuracy, and one‑stage detectors (e.g., YOLO, SSD) offering higher speed. Recent advances include FPN, RetinaNet, and techniques for small or occluded objects.

Algorithm Selection

Considering business needs for high precision and fast inference, YOLOv3 was chosen. It balances accuracy and speed, supporting 274 commercial object categories.

Model Overview and Optimization

YOLOv3 Architecture

Backbone: Darknet‑53 (standard convolutions + 23 residual units). Detection: three branches handling small, medium, and large objects, with high‑level features feeding back to lower levels. Output dimension: [3 × (4 + 1 + 274)] × N × N.

Loss Function

The loss comprises bounding‑box coordinate loss, objectness loss, and classification loss (logistic regression for multi‑label classification).

Data Optimization

Training data from OpenImages (600 classes) was expanded with ImageNet and filtered to 274 commercial classes. Class imbalance was addressed by oversampling rare classes and applying augmentations (rotation, Gaussian noise, stitching).

Model Optimizations

Adopted deformable convolutions to better model non‑rigid objects.

Adjusted loss weights to prioritize classification accuracy over precise localization.

Multi‑scale training (randomly selecting 320, 416, 608).

K‑means clustering to generate 9 anchor boxes tailored to business‑relevant objects.

Label smoothing to reduce over‑confidence.

Post‑Processing

Soft‑NMS (linear and Gaussian kernels) was used to handle overlapping detections more gracefully than standard NMS.

Deployment Platform

The trained model was containerized with Docker, exposed via an API, and integrated into an online scheduling platform. A single machine processes ~600 full‑length videos per day; a cluster handles >10,000 videos daily. Detection results are stored in a database, filtered, and fed to a point‑tagging system for ad placement.

Results

On 274‑class and 93‑class models, [email protected] improved significantly after optimization (see figures). The 93‑class model reached 0.596 mAP.

Business Impact

Processing over 50,000 videos generated >90 million frame‑level detections. Commercially valuable objects (274) were grouped by brand and ad creative, enabling targeted ads in dining, automotive, and mobile scenes (Coca‑Cola, Kangshifu, Ford, Samsung). Compared with subtitle‑based placements, object‑based slots achieved 3.5× higher exposure while still delivering a 20% higher click‑through rate.

Conclusion and Outlook

Object detection has matured enough to be deployed in large‑scale video advertising. YOLOv3, fine‑tuned for high‑value objects, delivers both speed and accuracy, dramatically expanding ad inventory and improving user experience. Future work includes frame‑selection optimization, incorporating LSTM and optical‑flow for temporal context, and extending detection to other domains such as fashion in video‑e‑commerce.

References

Cascade Object Detection with Deformable Part Models

YOLOv3: An Incremental Improvement

Focal Loss for Dense Object Detection

Rethinking the Inception Architecture for Computer Vision

Deformable Convolutional Networks

Deformable ConvNets v2: More Deformable, Better Results

Soft‑NMS – Improving Object Detection With One Line of Code

Towards High Performance Video Object Detection

T‑CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos

Flow‑guided Feature Aggregation for Video Object Detection

Deep Learning for Generic Object Detection – A Survey

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision deep learning object detection content recommendation YOLOv3 video advertising

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.