How ContourNet and CenterNet Revolutionize Text Detection

This article explains the challenges of scene text detection and introduces two state‑of‑the‑art models, ContourNet and CenterNet, detailing their architectural innovations, loss functions, and how they overcome issues like extreme aspect ratios and anchor‑based inefficiencies.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
How ContourNet and CenterNet Revolutionize Text Detection

In a recent technical reading session we revisited the origins and difficulties of scene text detection and then focused on two advanced deep‑learning models: ContourNet and CenterNet.

ContourNet

Presented at CVPR 2020, ContourNet addresses two major problems: interference from similar textures that cause oversized bounding boxes, and the extreme width‑to‑height ratios of text instances. It introduces two key innovations:

Adaptive‑RPN : an extension of the traditional Region Proposal Network that better adapts to varied aspect ratios of text.

Local Orthogonal Texture‑aware Module (LOTM) : a dual‑branch module that extracts horizontal and vertical texture features and refines them with a Point Re‑scoring Algorithm, ensuring only points with strong responses in both directions are kept.

ContourNet also redefines the bounding box representation from a center point plus width/height to a center point plus eight boundary points, and replaces the Smooth L1 loss with an IoU‑based loss for greater scale invariance.

Figure 2 shows the overall network: an FPN backbone extracts multi‑scale features, which are fed into the Adaptive‑RPN, followed by the LOTM and regression heads.

CenterNet

CenterNet adopts an anchor‑free, keypoint‑based detection paradigm. Instead of enumerating anchor boxes, it predicts a heatmap of object centers and directly regresses offset, width, and height for each center.

The heatmap is generated by mapping object centers onto a low‑resolution feature map (down‑sample factor 4) and applying a Gaussian kernel, yielding a value of 1 at the exact center and decreasing values outward.

CenterNet’s loss combines a focal loss variant for the heatmap, an offset loss, and a width‑height loss, enabling balanced learning of easy and hard samples.

Beyond 2‑D object detection, the same architecture can be adapted for 3‑D vision and pose estimation by swapping the backbone.

Figure 9 illustrates the full pipeline: a 512×512×3 image is processed by a ResNet backbone, up‑sampled to a 128×128 feature map, and then split into three branches for heatmap, center offset, and size predictions.

computer visiondeep learningobject detectiontext detectionCenterNetContourNet
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.