Mastering Text Recognition: Encoder & Decoder Strategies Explained

This article reviews modern text‑recognition systems, detailing how encoders such as CNN, CNN‑BiLSTM, and Transformer‑based models extract visual features, and how decoders like Position Attention, Transformer decoders, and RNN Seq2Seq align variable‑length text, while also discussing CTC loss and practical design choices.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
Mastering Text Recognition: Encoder & Decoder Strategies Explained

Overview

Text recognition aims to extract textual information from images. The system takes an image as input and outputs a text sequence. A Seq2Seq framework is used, consisting of an Encoder that converts the image into a feature sequence and a Decoder that generates the text.

Text Recognition Encoder

Encoder extracts features from the input image and produces a serialized feature sequence. Four main encoder architectures are discussed:

CNN : conventional feature extractors such as ResNet or VGG. Adjust stride to reduce down‑sampling in the width dimension to obtain longer sequences.

CNN + BiLSTM (CRNN) : adds a bidirectional LSTM to capture contextual information along the sequence.

CNN + 1D Transformer Unit : applies a standard Transformer encoder to the feature sequence, enhancing global context via multi‑head self‑attention.

CNN + 2D Transformer Unit : applies self‑attention directly on the feature map, using positional encodings in both height and width to preserve spatial information.

Some models predict directly from the encoder (e.g., CRNN) but face alignment issues because the encoder output length is fixed while text length varies. To address this, a blank token and CTC loss are introduced.

Text Recognition Decoder

Decoders align the variable‑length text with the fixed‑length encoder output. Common decoder designs include:

Position Attention : embeds position indices as queries to attend to encoder features, converting the sequence length.

Position Attention + Transformer Decoder / Unit : feeds the position‑aligned sequence into a Transformer decoder or additional Transformer units for progressive decoding.

RNN Seq2Seq Attention : classic attention‑based RNN that processes the sequence step‑by‑step, but cannot be parallelized.

Compared with Transformer‑based decoders, RNN decoders are slower due to sequential computation.

References

Robust Scene Text Recognition with Automatic Rectification

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

On Recognizing Texts of Arbitrary Shapes with 2D Self‑Attention

Rosetta: Large‑scale system for text detection and recognition in images

An End‑to‑End Trainable Neural Network for Image‑based Sequence Recognition and Its Application to Scene Text Recognition

Attention Is All You Need

CNNTransformerOCRtext recognitionDecoderCTCEncoder
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.