How Attention Boosts Text Recognition: From CNN‑Seq2Seq to Multi‑Scale Models
This article explains how attention mechanisms are applied to text recognition, covering the basic CNN‑Seq2Seq‑Attention architecture, multi‑scale attention extensions, and a 2D attentional irregular scene text recognizer with detailed network components, training loss, and experimental results.
Attention in Text Recognition
Presenter: Xie Jun (Junior Image Algorithm Engineer)
1. Attention for Text Recognition – Basic Model
(1) CNN+Seq2Seq+Attention Architecture
The model consists of an Encoder and a Decoder. The Encoder extracts features from a 3×32×280 image using a backbone, producing a 512×1×71 feature map, which is then processed by a BiLSTM to obtain a 256×71 sequence [e1, e2, …, eT] .
The Decoder iteratively decodes the sequence, generating one character at a time until termination. Attention is applied during decoding in three steps:
Compute attention weights weight.
Use weight to compute a context vector c as the weighted sum of encoder outputs.
Combine c with the hidden state h and feed them into a GRU to produce the current prediction and update the hidden state.
The initial hidden state h0 is a learned initialization value.
Decoder operates in a loop whose length equals the text length; during training the length is the maximum length in the batch, while during inference a fixed maximum length is set.
2. Multi‑Scale Attention Text Recognition
The model uses two feature maps extracted by DenseNet at resolutions H×W×C and 2H×2W×C to preserve details across scales.
During decoding, each scale’s feature map is attended to separately:
Input characters and hidden state are fed into a GRU to obtain pre. pre is transformed and combined with each scale’s encoder output via fully‑connected attention ( fcatt) to produce two context vectors.
The two context vectors are concatenated and, together with pre, are input to another GRU to generate the prediction and update the hidden state.
3. 2D Attentional Irregular Scene Text Recognizer
Background : Recognizing text in irregular scenes is challenging. Existing pipelines include image rectification, 2D spatial encoding, semantic segmentation with character masks, and encoder‑decoder approaches. The presented method introduces a novel architecture.
Network Architecture : ResNet backbone + Relation Attention + Parallel Attention + Two‑stage Decoder.
Relation Attention : Two Transformer layers that model global dependencies.
Parallel Attention : Independent 2D attention for each time step, allowing parallel computation of attention weights αt without sequential dependence.
Two‑stage Decoder : One branch decodes directly from Parallel Attention; the other first applies Relation Attention before decoding.
Loss Function : Standard cross‑entropy loss is applied to both decoder stages and summed during training.
4. Experimental Results
Evaluation on multiple benchmarks shows that the proposed method achieves superior accuracy compared to prior approaches, and the parallel attention design significantly speeds up inference.
Speed comparison demonstrates notable acceleration due to parallel attention.
TiPaiPai Technical Team
At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
