A Seq2Seq Deep Learning Approach for Recognizing Mathematical Formulas in Images
This article presents a deep‑learning Seq2Seq model that converts images of mathematical formulas—including matrices, equations, fractions, and radicals—into LaTeX sequences with over 95% accuracy, detailing data preparation, LaTeX normalization, model architecture, training, inference, and post‑processing techniques.
OCR (Optical Character Recognition) transforms information on images (Chinese characters, letters, numbers, etc.) into editable electronic text. With the rapid development of artificial intelligence, deep‑learning‑based OCR has been widely applied in education for tasks such as intelligent grading and textbook entry.
While current deep‑learning OCR achieves high accuracy for simple one‑dimensional text, its performance on two‑dimensional structures like mathematical and chemical formulas remains limited. To address this gap, the article proposes a technique capable of recognizing two‑dimensional formula structures (e.g., matrices, equation systems, fractions, radicals) with an accuracy exceeding 95%.
2 Technical Route
Formula recognition is treated as an image‑to‑LaTeX translation problem using a Seq2Seq network architecture. The model input is a formula image, and the output is the corresponding LaTeX sequence.
Figure 1: Overview of the mathematical formula recognition model
2.1 Data Preparation
To build a robust model, data is collected through a three‑step strategy: (1) analyzing real‑world formula characteristics and synthesizing data accordingly; (2) applying data augmentation to increase diversity; (3) iteratively gathering hard cases based on recognition confidence to improve generalization.
2.2 LaTeX Formula Normalization
Because a single formula can have multiple LaTeX representations, a normalization strategy is employed to ensure a one‑to‑one mapping between symbols and LaTeX strings, reducing training loss instability.
Figure 2: Non‑unique LaTeX representations
2.3 Seq2Seq Network Architecture
The Seq2Seq model, originally introduced for machine translation, consists of an encoder and a decoder, enabling effective learning of formula structural features such as superscripts, subscripts, and enclosing constructs.
2.3.1 Encoder
The encoder extracts feature maps from formula images, adopting an Inception‑ResNet‑V2 backbone. It incorporates multi‑receptive‑field Inception modules to capture varying font sizes and uses Position Embedding to retain spatial relationships between characters.
Figure 3: Encoder network architecture
After obtaining the feature map, it is reshaped into a one‑dimensional semantic vector while preserving positional information thanks to the embedded positions.
2.3.2 Decoder
The decoder transforms the semantic vector into a LaTeX sequence using an LSTM network, which mitigates long‑term dependency issues of standard RNNs. An attention mechanism further weights encoder outputs to focus on the most relevant parts during each decoding step.
Figure 4: Decoder network architecture
2.4 Training Phase
During training, teacher forcing is applied: the ground‑truth token sequence is fed as input at each time step, preventing instability caused by using the model’s own predictions early in training.
Figure 5: Training phase illustration
2.5 Inference Phase
During inference, the model generates tokens autoregressively, using its own previous output as the next input. Decoding can be performed with Greedy Search (beam size = 1) or Beam Search, which balances global optimality and computational cost; an example of beam size = 3 is shown.
Figure 6: Inference phase illustration (beam search)
2.6 Post‑Processing
Post‑processing corrects common confusions (e.g., “0” vs. “o”, “1” vs. “l”) using domain‑specific priors, improving accuracy by roughly 1% without affecting performance.
2.7 Recognition Results
The model converts formula images directly into LaTeX code; visual examples are shown where the generated LaTeX is rendered via XeLaTeX and ImageMagick for comparison.
Figure 7: Sample recognition results
3 Conclusion
The proposed formula recognition model effectively addresses two‑dimensional structure recognition, achieving an average accuracy above 95%. However, accuracy for highly complex formulas with long LaTeX sequences remains a challenge due to long‑range dependencies in decoding; future work will explore graph‑based models to better handle such cases.
New Oriental Technology
Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.