How ContourNet and CenterNet Revolutionize Text Detection
This article explains the challenges of scene text detection and introduces two state‑of‑the‑art models, ContourNet and CenterNet, detailing their architectural innovations, loss functions, and how they overcome issues like extreme aspect ratios and anchor‑based inefficiencies.
In a recent technical reading session we revisited the origins and difficulties of scene text detection and then focused on two advanced deep‑learning models: ContourNet and CenterNet.
ContourNet
Presented at CVPR 2020, ContourNet addresses two major problems: interference from similar textures that cause oversized bounding boxes, and the extreme width‑to‑height ratios of text instances. It introduces two key innovations:
Adaptive‑RPN : an extension of the traditional Region Proposal Network that better adapts to varied aspect ratios of text.
Local Orthogonal Texture‑aware Module (LOTM) : a dual‑branch module that extracts horizontal and vertical texture features and refines them with a Point Re‑scoring Algorithm, ensuring only points with strong responses in both directions are kept.
ContourNet also redefines the bounding box representation from a center point plus width/height to a center point plus eight boundary points, and replaces the Smooth L1 loss with an IoU‑based loss for greater scale invariance.
Figure 2 shows the overall network: an FPN backbone extracts multi‑scale features, which are fed into the Adaptive‑RPN, followed by the LOTM and regression heads.
CenterNet
CenterNet adopts an anchor‑free, keypoint‑based detection paradigm. Instead of enumerating anchor boxes, it predicts a heatmap of object centers and directly regresses offset, width, and height for each center.
The heatmap is generated by mapping object centers onto a low‑resolution feature map (down‑sample factor 4) and applying a Gaussian kernel, yielding a value of 1 at the exact center and decreasing values outward.
CenterNet’s loss combines a focal loss variant for the heatmap, an offset loss, and a width‑height loss, enabling balanced learning of easy and hard samples.
Beyond 2‑D object detection, the same architecture can be adapted for 3‑D vision and pose estimation by swapping the backbone.
Figure 9 illustrates the full pipeline: a 512×512×3 image is processed by a ResNet backbone, up‑sampled to a 128×128 feature map, and then split into three branches for heatmap, center offset, and size predictions.
TiPaiPai Technical Team
At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
