Artificial Intelligence 11 min read

How Graph Neural Networks Revolutionize Arbitrary‑Shaped Text Detection

This article reviews two recent computer‑vision approaches—DRRG and STKM—that combine CNN backbones with graph‑based relational reasoning and self‑attention to achieve state‑of‑the‑art detection of arbitrarily shaped text in images.

TiPaiPai Technical Team

Jul 2, 2021

How Graph Neural Networks Revolutionize Arbitrary‑Shaped Text Detection

Part 1: DRRG – Deep Relational Reasoning Graph Network

The paper proposes an end‑to‑end trainable framework for detecting text of arbitrary shapes by integrating a CNN‑based text proposal network with a Graph Convolutional Network (GCN) for relational reasoning.

Key contributions include:

First use of GCN to perform deep relational reasoning among text components.

Achieves state‑of‑the‑art performance on both polygon and quadrilateral datasets.

Text detection methods are categorized into regression‑based, segmentation‑based, and connected‑component (CC)‑based approaches. CC‑based methods require modeling relationships between isolated character components, which prior works handle with predefined rules or embedding mappings that are not robust for non‑Euclidean data.

DRRG addresses this by converting text components into graph nodes and applying relational reasoning:

Network architecture: shared convolution (VGG‑16 + FPN), text component prediction, local graph construction, relational reasoning, and component merging.

The shared convolution extracts multi‑scale features; the text component predictor generates text region (TR) and text center region (TCR) proposals. Losses consist of classification (cross‑entropy) and regression (smooth L1) terms, with separate terms for TR, TCR‑positive, and TCR‑negative pixels.

Local graphs are built only for neighboring components to keep computation efficient. For each pivot node, a subgraph (Instance Pivot Subgraph, IPS) is constructed using k‑hop neighbors (k1=10 for 1‑hop, k2=2 for 2‑hop). Edge weights are computed from Euclidean distances normalized by image dimensions.

The relational reasoning network takes the feature matrix X (combining RROI‑Align features and geometric features) and adjacency matrix A (connecting each node to its u=3 nearest neighbors) and applies four stacked GCN layers with ReLU activation:

After reasoning, a BFS clustering merges components into final text instances. Experiments on multiple benchmarks show accurate detection of arbitrary‑shaped text and competitive performance compared with SOTA methods.

Part 2: STKM – Self‑Attention based Text Knowledge Mining

STKM introduces the first generic pre‑training model for text detection. While ImageNet pre‑training captures visual semantics, it lacks textual knowledge; STKM injects text‑specific priors via a self‑attention decoder.

The architecture consists of a CNN encoder (ResNet + FPN) and a self‑attention decoder with four identical layers, each containing multi‑head self‑attention, multi‑head cross‑attention, and a feed‑forward network.

To preserve spatial information lost after flattening the encoder output, an Adaptive Spatial Position Encoding Module (ASPM) learns positional encodings through convolutions and adds them to the features.

Experimental results on ICDAR2015 demonstrate that replacing the backbone of the EAST detector with the STKM‑pre‑trained model significantly improves detection quality, as shown by heat‑map visualizations and side‑by‑side comparisons.

Both DRRG and STKM illustrate how graph‑based relational reasoning and self‑attention mechanisms can be leveraged to advance arbitrary‑shaped text detection, offering valuable design patterns for future computer‑vision research.

CNN computer vision deep learning graph neural network GCN text detection self‑attention

Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.