How Multi‑Scale Attention and DenseNet Boost Handwritten Math Expression Recognition

This article reviews a CVPR 2018 paper that introduces a dense‑connected encoder and multi‑scale attention mechanism to improve handwritten mathematical expression recognition, detailing the background, network architecture, GRU decoder, loss function, and experimental gains over previous methods.

TiPaiPai Technical Team
TiPaiPai Technical Team
TiPaiPai Technical Team
How Multi‑Scale Attention and DenseNet Boost Handwritten Math Expression Recognition

Background

Handwritten expression recognition involves two main challenges: character recognition and structural analysis. Traditional sequential and global approaches struggle with hard‑to‑recognize characters, require prior knowledge for mathematical grammar, and have increasing algorithmic complexity as the grammar becomes more intricate.

Scale variations in handwritten symbols cause loss of fine details in low‑resolution feature maps, making small elements like decimal points indistinguishable after pooling.

Innovations

Adoption of a densely connected convolutional network (DenseNet) to improve the CNN encoder.

Introduction of a multi‑scale attention mechanism to mitigate information loss caused by pooling.

Implementation Details

1. Dense Encoder

DenseNet connects each layer to all preceding layers via channel‑wise concatenation, enabling feature reuse and reducing the number of filters per layer while preserving rich representations.

To keep feature‑map sizes consistent, DenseNet employs a DenseBlock + Transition structure: DenseBlock contains multiple layers with identical spatial dimensions and dense connections; Transition modules reduce spatial size via pooling.

2. Decoding

The decoder is a Gated Recurrent Unit (GRU), derived from LSTM. GRU combines the forget and input gates, simplifying the architecture while retaining expressive power.

GRU gates:

Forget gate controls how much of the previous hidden state to discard.

Input gate determines how much new information to add.

Output gate decides what to expose to the next layer.

3. Multi‑Scale Attention

The network uses three DenseBlocks in the main branch. Before the first block, a 7×7 convolution (stride 2) and a 2×2 max‑pool reduce spatial size. Each 3×3 convolution is preceded by a 1×1 bottleneck to lower parameter count.

Two feature‑map resolutions are produced: a low‑resolution map (A) and a high‑resolution map (B). The attention module combines them as follows:

Here, s_{t‑1} is the previous decoder state, \hat{s}_t is the predicted current state, c^A_t and c^B_t are low‑ and high‑resolution context vectors, and their concatenation c_t is fed into the decoder at step t.

Loss Function

The model is trained with the standard cross‑entropy loss.

Experiments

Replacing the VGG encoder with DenseNet (Dense) while keeping other components unchanged yields an ExpRate increase of ~5.7% on CROHME 2014 and ~5.5% on CROHME 2016. Adding multi‑scale attention (Dense+MSA) further raises ExpRate to 52.8% (CROHME 2014) and 50.1% (CROHME 2016), demonstrating the effectiveness of the proposed approach.

Visualization shows that low‑resolution features lose small details (e.g., decimal points), while high‑resolution features preserve them, confirming the benefit of multi‑scale attention.

Comparison with other methods on the CROHME 2014 test set shows the proposed technique achieving state‑of‑the‑art performance at the time of publication.

computer visionGRUhandwritten recognitionDenseNetmulti-scale attention
TiPaiPai Technical Team
Written by

TiPaiPai Technical Team

At TiPaiPai, we focus on building engineering teams and culture, cultivating technical insights and practice, and fostering sharing, growth, and connection.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.