Artificial Intelligence 18 min read

Boosting Video Moderation with Multimodal CLIP and Efficient Vector Search

This article describes how a video review system combines multimodal CLIP models, image‑text feature alignment, and optimized vector‑search databases such as RedisSearch and Elasticsearch to detect prohibited content in real time and perform large‑scale historical recall, while addressing challenges of generalization, storage cost, and inference speed.

Sohu Tech Products

Jul 23, 2025

Boosting Video Moderation with Multimodal CLIP and Efficient Vector Search

Background

In video moderation, certain prohibited content—such as political defamation, gambling ads, or malicious propaganda—must be strictly controlled and removed. The system needs to (1) detect violations in newly uploaded videos and (2) recall historical videos when new standards emerge.

Solution Exploration

Initially, a simple image‑matching approach using a database of prohibited images was considered. However, this faced two major issues:

Low generalization because matching was performed on whole images, missing partial violations.

High storage and computation cost: extracting 512‑dimensional float vectors for millions of frames leads to several gigabytes of data that must be kept in memory for fast similarity search.

To address these, two strategies were proposed:

Enhance generalization by adding textual descriptions to images and fine‑tuning a multimodal model so it can directly predict violations.

Separate the requirements for real‑time detection (small in‑memory index) and historical recall (large disk‑based store) to reduce memory usage.

Multimodal Violation Detection

A CLIP‑based model (Chinese‑CLIP) was adopted to align image and text features. The model consists of a RoBERTa encoder for text (12 layers, 102M parameters) and a Vision Transformer encoder for images (12 layers, 86M parameters). Both encoders produce embeddings in the same space, enabling cosine similarity calculations between any image‑image, image‑text, or text‑text pair.

Model Overview

The architecture is illustrated below:

During fine‑tuning, the image encoder is initially frozen while the text encoder is trained, then both are jointly optimized using a contrastive loss that pushes paired image‑text embeddings together and pulls mismatched pairs apart.

System Integration

The workflow includes:

Building a multimodal sample library of prohibited images and texts.

Real‑time detection: extract features from video keyframes, perform K‑nearest‑neighbor (KNN) search against the sample library, and flag frames that exceed a similarity threshold.

Historical recall: store all frame features in a persistent vector store and run batch KNN queries when new violation criteria appear.

Thresholds are periodically adjusted based on precision and recall metrics from human reviews.

Fine‑Tuning for Black‑Market Ads

To improve detection of black‑market advertising (URLs, phone numbers, etc.), a curated dataset of ~3K images was prepared, augmented to ~12K samples, and split 8:1:1 for training, validation, and testing. The data were stored in TSV and JSONL formats:

1000002 /9j/4AAQSkZJ...YQj7314oA//2Q==

{"text_id": 8428, "text": "网址QQ黑产广告", "image_ids": [1076345, 517602]}

Fine‑tuning was executed with a single‑GPU script:

python -m torch.distributed.launch --use_env --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=localhost --master_port=8514 cn_clip/training/main.py --train-data=datasets/AD/lmdb/train --val-data=datasets/AD/lmdb/valid --resume=pretrained_weights/clip_cn_vit-b-16.pt --reset-data-offset --reset-optimizer --logs=experiments/ --name=ad_finetune_vit-b-16_roberta-base_bs128_1gpu_2 --save-step-frequency=999999 --save-epoch-frequency=30 --log-interval=1 --report-training-batch-acc --context-length=52 --warmup=100 --batch-size=96 --valid-batch-size=96 --valid-step-interval=150 --valid-epoch-interval=1 --accum-freq=1 --lr=2e-6 --wd=0.001 --max-epochs=150 --vision-model=ViT-B-16 --use-augment --text-model=RoBERTa-wwm-ext-base-chinese

Evaluation showed substantial improvements:

Image‑to‑text recall increased from 67.79% to 98.88% (+45.86%).

Text‑to‑image recall increased from 37.68% to 88.53% (+50.85%).

Real‑time monitoring uses image‑to‑text retrieval, while historical recall uses text‑to‑image retrieval.

Vector Search – Fast Retrieval for Massive Data

Storing 512‑dimensional vectors for millions of frames (~3 GB per day) requires careful database selection. Four options were evaluated:

RedisSearch – in‑memory, supports HNSW, low operational cost.

ElasticSearch – hybrid memory/disk, supports HNSW, low operational cost.

Milvus – rich index types, high operational cost.

Faiss – library only, no storage, low operational cost.

RedisSearch was chosen for real‑time detection due to its low memory footprint and ease of deployment, while ElasticSearch was selected for historical recall because it stores data on disk and can filter with Lucene indexes before vector computation.

Inference Acceleration

Initial PyTorch inference took ~369 ms per image on an Intel Gold 5218R CPU. Quantization to ONNX FP16 slowed it down, but using OpenVINO yielded significant speedups:

OpenVINO ONNX FP32: 127 ms (≈190% faster).

OpenVINO IR FP16: 113 ms (≈227% faster).

Model size was also halved, reducing daily storage to ~1.5 GB.

Future Outlook

Planned enhancements include:

Deploying GPU‑accelerated TensorRT inference for further speed gains.

Leveraging Milvus’s disk‑based and GPU‑based indexes for historical and real‑time search respectively.

Applying higher‑ratio knowledge distillation to the fine‑tuned Chinese‑CLIP model for additional compression.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI model fine-tuning vector search CLIP multimodal detection video moderation

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.