How Multi‑Scale Attention and DenseNet Boost Handwritten Math Expression Recognition
This article reviews a CVPR 2018 paper that introduces a dense‑connected encoder and multi‑scale attention mechanism to improve handwritten mathematical expression recognition, detailing the background, network architecture, GRU decoder, loss function, and experimental gains over previous methods.
Background
Handwritten expression recognition involves two main challenges: character recognition and structural analysis. Traditional sequential and global approaches struggle with hard‑to‑recognize characters, require prior knowledge for mathematical grammar, and have increasing algorithmic complexity as the grammar becomes more intricate.
Scale variations in handwritten symbols cause loss of fine details in low‑resolution feature maps, making small elements like decimal points indistinguishable after pooling.
Innovations
Adoption of a densely connected convolutional network (DenseNet) to improve the CNN encoder.
Introduction of a multi‑scale attention mechanism to mitigate information loss caused by pooling.
Implementation Details
1. Dense Encoder
DenseNet connects each layer to all preceding layers via channel‑wise concatenation, enabling feature reuse and reducing the number of filters per layer while preserving rich representations.
To keep feature‑map sizes consistent, DenseNet employs a DenseBlock + Transition structure: DenseBlock contains multiple layers with identical spatial dimensions and dense connections; Transition modules reduce spatial size via pooling.
2. Decoding
The decoder is a Gated Recurrent Unit (GRU), derived from LSTM. GRU combines the forget and input gates, simplifying the architecture while retaining expressive power.
GRU gates:
Forget gate controls how much of the previous hidden state to discard.
Input gate determines how much new information to add.
Output gate decides what to expose to the next layer.
3. Multi‑Scale Attention
The network uses three DenseBlocks in the main branch. Before the first block, a 7×7 convolution (stride 2) and a 2×2 max‑pool reduce spatial size. Each 3×3 convolution is preceded by a 1×1 bottleneck to lower parameter count.
Two feature‑map resolutions are produced: a low‑resolution map (A) and a high‑resolution map (B). The attention module combines them as follows:
Here, s_{t‑1} is the previous decoder state, \hat{s}_t is the predicted current state, c^A_t and c^B_t are low‑ and high‑resolution context vectors, and their concatenation c_t is fed into the decoder at step t.
Loss Function
The model is trained with the standard cross‑entropy loss.
Experiments
Replacing the VGG encoder with DenseNet (Dense) while keeping other components unchanged yields an ExpRate increase of ~5.7% on CROHME 2014 and ~5.5% on CROHME 2016. Adding multi‑scale attention (Dense+MSA) further raises ExpRate to 52.8% (CROHME 2014) and 50.1% (CROHME 2016), demonstrating the effectiveness of the proposed approach.
Visualization shows that low‑resolution features lose small details (e.g., decimal points), while high‑resolution features preserve them, confirming the benefit of multi‑scale attention.
Comparison with other methods on the CROHME 2014 test set shows the proposed technique achieving state‑of‑the‑art performance at the time of publication.
TiPaiPai Technical Team
At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
