How Attention Boosts Text Recognition: From CNN‑Seq2Seq to Multi‑Scale Models

This article explains how attention mechanisms are applied to text recognition, covering the basic CNN‑Seq2Seq‑Attention architecture, multi‑scale attention extensions, and a 2D attentional irregular scene text recognizer with detailed network components, training loss, and experimental results.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
How Attention Boosts Text Recognition: From CNN‑Seq2Seq to Multi‑Scale Models

Attention in Text Recognition

Presenter: Xie Jun (Junior Image Algorithm Engineer)

1. Attention for Text Recognition – Basic Model

(1) CNN+Seq2Seq+Attention Architecture

The model consists of an Encoder and a Decoder. The Encoder extracts features from a 3×32×280 image using a backbone, producing a 512×1×71 feature map, which is then processed by a BiLSTM to obtain a 256×71 sequence [e1, e2, …, eT] .

The Decoder iteratively decodes the sequence, generating one character at a time until termination. Attention is applied during decoding in three steps:

Compute attention weights weight.

Use weight to compute a context vector c as the weighted sum of encoder outputs.

Combine c with the hidden state h and feed them into a GRU to produce the current prediction and update the hidden state.

The initial hidden state h0 is a learned initialization value.

Decoder operates in a loop whose length equals the text length; during training the length is the maximum length in the batch, while during inference a fixed maximum length is set.

2. Multi‑Scale Attention Text Recognition

The model uses two feature maps extracted by DenseNet at resolutions H×W×C and 2H×2W×C to preserve details across scales.

During decoding, each scale’s feature map is attended to separately:

Input characters and hidden state are fed into a GRU to obtain pre. pre is transformed and combined with each scale’s encoder output via fully‑connected attention ( fcatt) to produce two context vectors.

The two context vectors are concatenated and, together with pre, are input to another GRU to generate the prediction and update the hidden state.

3. 2D Attentional Irregular Scene Text Recognizer

Background : Recognizing text in irregular scenes is challenging. Existing pipelines include image rectification, 2D spatial encoding, semantic segmentation with character masks, and encoder‑decoder approaches. The presented method introduces a novel architecture.

Network Architecture : ResNet backbone + Relation Attention + Parallel Attention + Two‑stage Decoder.

Relation Attention : Two Transformer layers that model global dependencies.

Parallel Attention : Independent 2D attention for each time step, allowing parallel computation of attention weights αt without sequential dependence.

Two‑stage Decoder : One branch decodes directly from Parallel Attention; the other first applies Relation Attention before decoding.

Loss Function : Standard cross‑entropy loss is applied to both decoder stages and summed during training.

4. Experimental Results

Evaluation on multiple benchmarks shows that the proposed method achieves superior accuracy compared to prior approaches, and the parallel attention design significantly speeds up inference.

Speed comparison demonstrates notable acceleration due to parallel attention.

CNNcomputer visiondeep learningAttentionSeq2Seqtext recognitionMulti-Scale
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.