Artificial Intelligence 20 min read

OCR-Based Video Review System: Technology Selection, Optimization, and Model Fine-Tuning

An OCR‑based video review system using PaddleOCR’s DB detector and SVTR recognizer, combined with multi‑level frame deduplication, message‑queue task decoupling, Redis prioritization, and dynamic thread‑pool scheduling, was fine‑tuned on 5 000 samples to cut daily frames from 794 million to 3.6 million, achieving automated detection of over 230 abnormal videos per day and replacing three manual reviewers, with future plans for GPU acceleration and cross‑instance GRPC dispatch.

Sohu Tech Products

Dec 27, 2023

OCR-Based Video Review System: Technology Selection, Optimization, and Model Fine-Tuning

In traditional video review scenarios, manual inspection of text in video frames is time-consuming and inconsistent due to human variability. To address this, an OCR-based automated solution is proposed.

The solution involves: (1) using OCR to recognize text in video frames; (2) matching recognized text against a keyword library; (3) blocking videos that match keywords and saving screenshots for annotation; (4) referencing OCR results during manual review to produce final outcomes.

Challenges identified include massive video frame data volume (approx 7.9 billion frames per day), high computational demands and latency of OCR algorithms, and the need for continual model upgrades to improve accuracy.

Technical selection compared end-to-end (e.g., PGNet) and two-stage OCR approaches (EasyOCR, ChineseOCR, PaddleOCR). PaddleOCR was chosen for its higher accuracy, support for custom training, ONNX compatibility, and suitability for Java-based deployment via DJL framework.

Further analysis of detection and recognition models favored DB for text detection (due to better performance in complex scenes) and SVTR for text recognition (superior accuracy, especially for small or complex character spacing).

To reduce frame processing volume, three deduplication strategies were applied: video-level MD5 hashing (≈28% reduction), keyframe extraction via frame differencing (≈99.1% reduction), and keyframe feature deduplication using Chinese-CLIP and Milvus (≈30% reduction), bringing daily processing from ~794 million to ~3.6 million frames.

To accelerate OCR response, an engineering approach was implemented: using a message queue (MQ) for task decoupling, prioritizing recent tasks via Redis ZSet, and a thread pool where each thread can perform both detection and recognition, dynamically adjusting workload to maximize CPU utilization and avoid bottlenecks.

Model fine-tuning was conducted on ~5000 annotated samples for both detection and recognition, using PaddleOCR's pre-trained models with Cosine/Piecewise learning rate schedules, L2 regularization, and a 6:2:2 train/validation/test split. Fine-tuning improved detection and recognition accuracy on domain-specific data.

Results show the system processed over 904,606 videos by September 2023, handling ~7K+ videos daily with abnormal video detection of 230+ per day, equivalent to the workload of three manual reviewers.

Future work includes GPU acceleration of OCR, upgrading scheduling to cross-instance GRPC-based dispatch with fast/slow queues, and further optimizing latency and resource utilization.

Global:
  pretrained_model: XXX.pdparams   # 预训练模型路径
Optimizer:
  lr:
    name: Cosine
    learning_rate: 0.001   # 学习率
    warmup_epoch: 2   # 预热轮次
  regularizer:
    name: 'L2'   # 使用L2正则化 防止过拟合
    factor: 0   # 学习率衰减系数
Train:
  loader:
    shuffle: True
    drop_last: False   # 是否丢弃因数据集样本数不能被 batch_size 整除而产生的最后一个不完整的mini-batch
    batch_size_per_card: 4   # 单卡batch size
    num_workers: 4   # 用于加载数据的子进程个数，若为0即为不开启子进程，在主进程中进行数据加载

python tools/train.py -c configs/det/xxx.yml -o Global.pretrained_model="./pretrain_models/"

python tools/eval.py -c configs/det/xxx.yml -o Global.checkpoints="./output/det/best_accuracy"

python tools/export_model.py -c ./output/det/xxx.yml -o Global.pretrained_model="./output/det/best_accuracy" Global.save_inference_dir="./output/det_inference/"

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI OCR fine-tuning multithreading model selection PaddleOCR video review

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.