How Graph Neural Networks Revolutionize Arbitrary‑Shaped Text Detection

This article reviews two recent computer‑vision approaches—DRRG and STKM—that combine CNN backbones with graph‑based relational reasoning and self‑attention to achieve state‑of‑the‑art detection of arbitrarily shaped text in images.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
How Graph Neural Networks Revolutionize Arbitrary‑Shaped Text Detection

Part 1: DRRG – Deep Relational Reasoning Graph Network

The paper proposes an end‑to‑end trainable framework for detecting text of arbitrary shapes by integrating a CNN‑based text proposal network with a Graph Convolutional Network (GCN) for relational reasoning.

Key contributions include:

First use of GCN to perform deep relational reasoning among text components.

Achieves state‑of‑the‑art performance on both polygon and quadrilateral datasets.

Text detection methods are categorized into regression‑based, segmentation‑based, and connected‑component (CC)‑based approaches. CC‑based methods require modeling relationships between isolated character components, which prior works handle with predefined rules or embedding mappings that are not robust for non‑Euclidean data.

DRRG addresses this by converting text components into graph nodes and applying relational reasoning:

Network architecture: shared convolution (VGG‑16 + FPN), text component prediction, local graph construction, relational reasoning, and component merging.

DRRG overall architecture
DRRG overall architecture

The shared convolution extracts multi‑scale features; the text component predictor generates text region (TR) and text center region (TCR) proposals. Losses consist of classification (cross‑entropy) and regression (smooth L1) terms, with separate terms for TR, TCR‑positive, and TCR‑negative pixels.

Loss components
Loss components

Local graphs are built only for neighboring components to keep computation efficient. For each pivot node, a subgraph (Instance Pivot Subgraph, IPS) is constructed using k‑hop neighbors (k1=10 for 1‑hop, k2=2 for 2‑hop). Edge weights are computed from Euclidean distances normalized by image dimensions.

IPS construction
IPS construction

The relational reasoning network takes the feature matrix X (combining RROI‑Align features and geometric features) and adjacency matrix A (connecting each node to its u=3 nearest neighbors) and applies four stacked GCN layers with ReLU activation:

GCN layer formula
GCN layer formula

After reasoning, a BFS clustering merges components into final text instances. Experiments on multiple benchmarks show accurate detection of arbitrary‑shaped text and competitive performance compared with SOTA methods.

Experimental results
Experimental results

Part 2: STKM – Self‑Attention based Text Knowledge Mining

STKM introduces the first generic pre‑training model for text detection. While ImageNet pre‑training captures visual semantics, it lacks textual knowledge; STKM injects text‑specific priors via a self‑attention decoder.

The architecture consists of a CNN encoder (ResNet + FPN) and a self‑attention decoder with four identical layers, each containing multi‑head self‑attention, multi‑head cross‑attention, and a feed‑forward network.

To preserve spatial information lost after flattening the encoder output, an Adaptive Spatial Position Encoding Module (ASPM) learns positional encodings through convolutions and adds them to the features.

ASPM module
ASPM module

Experimental results on ICDAR2015 demonstrate that replacing the backbone of the EAST detector with the STKM‑pre‑trained model significantly improves detection quality, as shown by heat‑map visualizations and side‑by‑side comparisons.

STKM detection results
STKM detection results

Both DRRG and STKM illustrate how graph‑based relational reasoning and self‑attention mechanisms can be leveraged to advance arbitrary‑shaped text detection, offering valuable design patterns for future computer‑vision research.

CNNcomputer visiondeep learninggraph neural networkGCNtext detectionself‑attention
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.