Artificial Intelligence 12 min read

How a New Bilingual Video Text Dataset and Transformer Spotter Advance Video OCR

This article reviews the NeurIPS 2021 paper introducing BOVText, a large‑scale bilingual video‑text dataset with over 2,000 videos and 1.75 million frames, and describes its transformer‑based end‑to‑end video text spotter that integrates EAST encoding into DETR, covering dataset collection, annotation, architecture, and experimental results.

Kuaishou Tech

Jan 5, 2022

How a New Bilingual Video Text Dataset and Transformer Spotter Advance Video OCR

Background

Text reading comprehension is a widely studied problem in computer vision. While image OCR has achieved high accuracy thanks to deep learning, video OCR remains challenging due to limited datasets, diverse scenarios, and the need for simultaneous detection, tracking, and recognition.

Dataset: BOVText

The paper proposes BOVText, a bilingual (Chinese‑English) video‑text dataset containing more than 2,000 videos and 1,750,000 video frames collected from KuaiShou and YouTube. The dataset covers 32 open‑domain scene categories (e.g., Vlog, games, sports) and provides four annotation types for each text instance:

Rotated bounding box (position)

Instance ID (temporal tracking)

Text transcription

Text category (caption, title, scene text)

Data were split 8:2 into 1,541 training videos and 480 test videos.

Algorithm Overview

The authors introduce a transformer‑based video text spotter that merges EAST‑style angle encoding into a DETR‑like architecture. The model consists of two main parts:

Video text tracking : Inspired by TransTrack, the network has a detection branch that produces detection boxes for the current frame and a tracking branch that queries object features from the previous frame to predict track boxes. An angle prediction head (from EAST) is added to the decoder to handle rotated boxes.

Text recognition : An attention‑based recognizer follows the design of recent scene‑text recognition works.

Architecture Details

The backbone and encoder are similar to deformable DETR. The decoder is split into two shared‑weight branches: a track decoder that receives queries from the previous frame’s detection features, and an object decoder that receives learned object queries for the current frame. After decoding, the model outputs a set of detection boxes and a set of track boxes, which are matched via IoU to obtain track IDs.

Bipartite Multi‑orient Box Matching

For each predicted box, a Hungarian algorithm solves a bipartite matching problem against ground‑truth boxes. The matching cost combines classification loss, L1 box loss, GIoU loss, and an additional angle loss from EAST, enabling optimal pairing of rotated boxes with objects.

Experiments

Two experimental settings are reported:

Benchmark : The proposed method is evaluated on BOVText and compared with previous video‑text datasets (ICDAR‑2015 video, YVT, RoadText‑1K). Results show large performance variance across scenarios; e.g., the highest tracking accuracy (88.4 %) is achieved on the Fishery scenario, while the lowest (46.7 %) occurs on Sports due to dense, low‑contrast scene text.

Method analysis : Additional experiments on three external datasets demonstrate modest advantages of the proposed pipeline over a simple detect‑then‑match baseline. The authors note that many easy scenarios (captions, clear text) still dominate the benchmark, leaving room for improvement on challenging scenes.

Conclusion

The paper presents BOVText, a large‑scale bilingual video‑text benchmark supporting four tasks: detection, recognition, tracking, and end‑to‑end spotting. It also introduces a transformer‑based tracking‑recognition algorithm that incorporates EAST angle encoding into DETR. Experiments confirm the dataset’s usefulness for evaluating video OCR methods and highlight the need for stronger models to handle diverse, real‑world video scenarios.

Resources: Paper – https://arxiv.org/pdf/2112.04888.pdf; BOVText Benchmark – https://github.com/weijiawu/BOVText-Benchmark; TransVTSpotter code – https://github.com/weijiawu/TransVTSpotter.

Transformer DETR video OCR tracking bilingual dataset BOVText video text spotting

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.