Mastering Text Recognition: Encoder & Decoder Strategies Explained
This article reviews modern text‑recognition systems, detailing how encoders such as CNN, CNN‑BiLSTM, and Transformer‑based models extract visual features, and how decoders like Position Attention, Transformer decoders, and RNN Seq2Seq align variable‑length text, while also discussing CTC loss and practical design choices.
Overview
Text recognition aims to extract textual information from images. The system takes an image as input and outputs a text sequence. A Seq2Seq framework is used, consisting of an Encoder that converts the image into a feature sequence and a Decoder that generates the text.
Text Recognition Encoder
Encoder extracts features from the input image and produces a serialized feature sequence. Four main encoder architectures are discussed:
CNN : conventional feature extractors such as ResNet or VGG. Adjust stride to reduce down‑sampling in the width dimension to obtain longer sequences.
CNN + BiLSTM (CRNN) : adds a bidirectional LSTM to capture contextual information along the sequence.
CNN + 1D Transformer Unit : applies a standard Transformer encoder to the feature sequence, enhancing global context via multi‑head self‑attention.
CNN + 2D Transformer Unit : applies self‑attention directly on the feature map, using positional encodings in both height and width to preserve spatial information.
Some models predict directly from the encoder (e.g., CRNN) but face alignment issues because the encoder output length is fixed while text length varies. To address this, a blank token and CTC loss are introduced.
Text Recognition Decoder
Decoders align the variable‑length text with the fixed‑length encoder output. Common decoder designs include:
Position Attention : embeds position indices as queries to attend to encoder features, converting the sequence length.
Position Attention + Transformer Decoder / Unit : feeds the position‑aligned sequence into a Transformer decoder or additional Transformer units for progressive decoding.
RNN Seq2Seq Attention : classic attention‑based RNN that processes the sequence step‑by‑step, but cannot be parallelized.
Compared with Transformer‑based decoders, RNN decoders are slower due to sequential computation.
References
Robust Scene Text Recognition with Automatic Rectification
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks
On Recognizing Texts of Arbitrary Shapes with 2D Self‑Attention
Rosetta: Large‑scale system for text detection and recognition in images
An End‑to‑End Trainable Neural Network for Image‑based Sequence Recognition and Its Application to Scene Text Recognition
Attention Is All You Need
TiPaiPai Technical Team
At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
