DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

DNTextSpotter is an arbitrary-shaped scene text spotting model using the DETR architecture with an improved denoising training scheme that adds noise to Bézier control points and employs mask‑sliding character queries, achieving significant benchmark gains without extra inference cost and enabling robust text recognition in challenging environments.

Bilibili Tech
Bilibili Tech
Bilibili Tech
DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

This article introduces DNTextSpotter, an arbitrary‑shaped scene text spotting model jointly developed by Bilibili AI Platform and Suzhou University and published at ACM MM. The work builds on the DETR architecture and proposes an improved denoising training strategy to address instability in bipartite graph matching.

The motivation is that denoising training has shown strong performance in generic object detection, but scene text spotting faces additional challenges due to arbitrary text shapes and the need for more complex recognition. By designing a denoising task that respects the geometric properties of text curves, the authors achieve a 11.3% gain on the Inverse‑Text dataset without changing the data or augmentations.

Method Overview

The overall architecture follows the classic DETR pipeline (backbone, Transformer encoder, decoder). The decoder input is split into a Matching Part (processed by bipartite matching) and a Denoising Part (loss computed directly). The key contributions lie in the Denoising Part:

Noised Positional Queries Generation – The process extracts Bézier control points of the text instance’s center curve, adds noise to these points, uniformly samples T points (T = max text length), and passes them through a positional‑encoding layer and an MLP to obtain the final queries. Adding noise at the control‑point level preserves a smooth positional prior, which improves training stability.

Noised Content Queries Generation – Instead of using coarse class labels, the method uses actual character tokens. To align content with position, a Mask Sliding Character (MSC) technique is introduced: characters are duplicated across spatial locations, randomly masked, and optionally flipped, then embedded to form the content queries. This encourages the model to learn flexible alignments and mitigates bias.

Decoder – Matching Part queries and Denoising Part queries are concatenated. The decoder predicts multiple tasks using heads inspired by DeepSolo and DINO. During inference, the Denoising Part is removed, incurring no extra computational cost.

Experiments

On public benchmarks (Total‑Text and CTW1500), DNTextSpotter outperforms the baseline DeepSolo by 2.0% and 2.8% respectively on the “None” metric, and achieves an 8.9% F1 improvement on Inverse‑Text without extra rotation data. Ablation studies confirm the benefits of adding noise to Bézier control points, the MSC alignment, and the additional background loss.

Visualization

Qualitative results on Inverse‑Text and other datasets show DNTextSpotter handling difficult cases better than previous SOTA models (ESTextSpotter, DeepSolo).

Practical Applications

The denoising training incurs no inference overhead, making it suitable for low‑quality video/image processing, harsh‑environment text recognition, and other Bilibili platform scenarios where robust text spotting is needed.

Conclusion & Outlook

The paper presents a novel denoising training scheme that can be extended to other DETR‑based recognizers and possibly to tasks beyond text spotting. It also suggests that denoising may help alleviate long‑tail recognition problems.

References

[1] Carion et al., End‑to‑end object detection with transformers, ECCV 2020. [2] Ye et al., DeepSolo: Let transformer decoder with explicit points solo for text spotting, CVPR 2023. [3] Ye et al., Dptext‑DETR: Towards better scene text detection with dynamic points, AAAI 2023. [4] Li et al., DN‑DETR: Accelerate DETR training by introducing query denoising, CVPR 2022. [5] Graves et al., Connectionist Temporal Classification, ICML 2006. [6] Zhang et al., DINO: DETR with improved denoising anchor boxes, arXiv 2022. [7] Xie et al., DNTextSpotter: Arbitrary‑Shaped Scene Text Spotting via Improved Denoising Training, arXiv 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

arbitrary-shaped textdenoising trainingDETRscene text spotting
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.