Artificial Intelligence 17 min read

ICDAR 2024 Historical Map Text Recognition Competition: DNTextSpotter Methodology and Results

The ICDAR 2024 Historical Map Text Recognition competition was won by Bilibili’s DNTextSpotter, a Transformer‑based model built on DeepSolo and ViTAE‑v2 that uses deformable self‑attention, dual‑query decoding and denoising training, combined with mixed‑vocabulary fine‑tuning, advanced loss functions and strict PDQ/PWQ/PCQ metrics to achieve state‑of‑the‑art dense, rotated, arbitrary‑shaped text detection and recognition on historical maps and real‑world multimedia.

Bilibili Tech

Oct 8, 2024

ICDAR 2024 Historical Map Text Recognition Competition: DNTextSpotter Methodology and Results

The ICDAR competition (https://rrc.cvc.uab.es/) is an internationally recognized benchmark for scene‑text detection and recognition. Each year it hosts several tracks, and the 2024 Historical Map Text Recognition track (https://rrc.cvc.uab.es/?ch=28) focuses on extracting dense, rotated, and arbitrarily shaped text from rasterized historical maps.

The competition consists of four tasks: (1) dense word detection, (2) short‑phrase detection, (3) word detection + recognition, and (4) short‑phrase detection + recognition. Bilibili AI Platform achieved first place in tasks 1, 3, and 4, and second place in task 2, with performance gaps ranging from +1.1% to +7.73% over the runner‑up.

Task 3 (Detection‑Recognition) is described in detail. The team introduced a self‑developed model called DNTextSpotter (paper accepted at ACM MM 2024). The architecture builds on DeepSolo and uses ViTAE‑v2 as the backbone, a deformable self‑attention encoder, and a dual‑query decoder (matching queries and denoising queries). Denoising queries are used only during training.

Key technical contributions include:

Adapting the Transformer‑based DETR paradigm for scene‑text by addressing bipartite matching instability with denoising training.

Designing a new denoising training scheme tailored to arbitrary‑shaped text.

Introducing a mixed‑training strategy that first fine‑tunes a small, case‑insensitive vocabulary on public data, then expands to a larger vocabulary for the historical‑map dataset.

Data preprocessing converts polygon annotations to Bézier control points, which matches DNTextSpotter’s input format. Data augmentation (random rotation ±45°, multi‑scale resizing, brightness/contrast/saturation adjustments) mitigates the limited training set (≈200 images).

Loss functions combine Focal Loss for binary text/background classification, CTC Loss for sequence transcription, and L1 Loss for coordinate regression.

Evaluation metrics are stricter than typical OCR benchmarks. Detection quality is measured by PDQ (Panoptic Detection Quality), which multiplies tightness (average IoU of true positives) and F‑score. For joint detection‑recognition, PWQ (Panoptic Word Quality) incorporates PDQ‑style tightness, polygon existence, and word‑level transcription accuracy. PCQ (Panoptic Character Quality) further adds a character‑level normalized edit distance (NED) component to assess fine‑grained recognition.

Experimental results show DNTextSpotter achieving state‑of‑the‑art performance across multiple public benchmarks and securing first place in several ICDAR 2024 tasks. Visualizations demonstrate robust detection and recognition of extremely dense, rotated, and curved text instances.

Beyond the competition, the model is applied to various Bilibili scenarios: multi‑text recognition in video/live streams, large‑scale batch video text annotation, brand detection in complex scenes, and recognition of exotic fonts or handwritten text. These applications illustrate the method’s practicality for real‑world multimedia content analysis.

In conclusion, the competition validates the effectiveness of DNTextSpotter and the mixed‑training strategy. Future work will focus on improving long‑tail class recognition, further optimizing data augmentation, and exploring newer Transformer variants.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Evaluation Metrics DNTextSpotter Historical Map OCR ICDAR Scene Text Detection

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.