Artificial Intelligence 9 min read

A Seq2Seq Deep Learning Approach for Recognizing Mathematical Formulas in Images

This article presents a deep‑learning Seq2Seq model that converts images of mathematical formulas—including matrices, equations, fractions, and radicals—into LaTeX sequences with over 95% accuracy, detailing data preparation, LaTeX normalization, model architecture, training, inference, and post‑processing techniques.

New Oriental Technology

Nov 23, 2020

A Seq2Seq Deep Learning Approach for Recognizing Mathematical Formulas in Images

OCR (Optical Character Recognition) transforms information on images (Chinese characters, letters, numbers, etc.) into editable electronic text. With the rapid development of artificial intelligence, deep‑learning‑based OCR has been widely applied in education for tasks such as intelligent grading and textbook entry.

While current deep‑learning OCR achieves high accuracy for simple one‑dimensional text, its performance on two‑dimensional structures like mathematical and chemical formulas remains limited. To address this gap, the article proposes a technique capable of recognizing two‑dimensional formula structures (e.g., matrices, equation systems, fractions, radicals) with an accuracy exceeding 95%.

2 Technical Route

Formula recognition is treated as an image‑to‑LaTeX translation problem using a Seq2Seq network architecture. The model input is a formula image, and the output is the corresponding LaTeX sequence.

Figure 1: Overview of the mathematical formula recognition model

2.1 Data Preparation

To build a robust model, data is collected through a three‑step strategy: (1) analyzing real‑world formula characteristics and synthesizing data accordingly; (2) applying data augmentation to increase diversity; (3) iteratively gathering hard cases based on recognition confidence to improve generalization.

2.2 LaTeX Formula Normalization

Because a single formula can have multiple LaTeX representations, a normalization strategy is employed to ensure a one‑to‑one mapping between symbols and LaTeX strings, reducing training loss instability.

Figure 2: Non‑unique LaTeX representations

2.3 Seq2Seq Network Architecture

The Seq2Seq model, originally introduced for machine translation, consists of an encoder and a decoder, enabling effective learning of formula structural features such as superscripts, subscripts, and enclosing constructs.

2.3.1 Encoder

The encoder extracts feature maps from formula images, adopting an Inception‑ResNet‑V2 backbone. It incorporates multi‑receptive‑field Inception modules to capture varying font sizes and uses Position Embedding to retain spatial relationships between characters.

Figure 3: Encoder network architecture

After obtaining the feature map, it is reshaped into a one‑dimensional semantic vector while preserving positional information thanks to the embedded positions.

2.3.2 Decoder

The decoder transforms the semantic vector into a LaTeX sequence using an LSTM network, which mitigates long‑term dependency issues of standard RNNs. An attention mechanism further weights encoder outputs to focus on the most relevant parts during each decoding step.

Figure 4: Decoder network architecture

2.4 Training Phase

During training, teacher forcing is applied: the ground‑truth token sequence is fed as input at each time step, preventing instability caused by using the model’s own predictions early in training.

Figure 5: Training phase illustration

2.5 Inference Phase

During inference, the model generates tokens autoregressively, using its own previous output as the next input. Decoding can be performed with Greedy Search (beam size = 1) or Beam Search, which balances global optimality and computational cost; an example of beam size = 3 is shown.

Figure 6: Inference phase illustration (beam search)

2.6 Post‑Processing

Post‑processing corrects common confusions (e.g., “0” vs. “o”, “1” vs. “l”) using domain‑specific priors, improving accuracy by roughly 1% without affecting performance.

2.7 Recognition Results

The model converts formula images directly into LaTeX code; visual examples are shown where the generated LaTeX is rendered via XeLaTeX and ImageMagick for comparison.

Figure 7: Sample recognition results

3 Conclusion

The proposed formula recognition model effectively addresses two‑dimensional structure recognition, achieving an average accuracy above 95%. However, accuracy for highly complex formulas with long LaTeX sequences remains a challenge due to long‑range dependencies in decoding; future work will explore graph‑based models to better handle such cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning OCR LaTeX Formula Recognition

Written by

New Oriental Technology

Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.