Artificial Intelligence 39 min read

Exploring Transformer Technology and Its Applications in NLP, Computer Vision, and OCR at Haodf.com

This article introduces the Transformer architecture, explains its attention mechanism, details its adaptations for natural language processing, computer vision, and OCR tasks, and presents experimental results of various models such as BERT, ELECTRA, Swin Transformer, and CRNN-BCN on large-scale medical data from Haodf.com.

HaoDF Tech Team
HaoDF Tech Team
HaoDF Tech Team
Exploring Transformer Technology and Its Applications in NLP, Computer Vision, and OCR at Haodf.com

Transformer technology was first introduced by Google Brain in the paper "Attention is All You Need" as a sequence‑to‑sequence model for machine translation. Its powerful feature extraction capability quickly replaced recurrent neural networks, becoming the dominant framework in natural language processing (NLP) and later achieving great success in computer vision (CV).

A key advantage of Transformers is the ability to pre‑train on massive amounts of unlabeled data, significantly improving downstream performance. Haodf.com possesses billions of doctor‑patient interaction texts and hundreds of millions of medical images, which can be leveraged using Transformers to provide more reliable assistance to patients.

The remainder of this article first introduces Transformer technology, then describes its practical exploration at Haodf.com, and finally summarizes its characteristics and future plans.

1. Introduction to Transformer Technology

Before Transformers, machine translation primarily used RNN/CNN encoders and decoders with attention mechanisms linking them. Google researchers argued that attention alone could capture all necessary features, eliminating the need for recurrent or convolutional components.

Unlike CNN/RNN, Transformers make no prior assumptions about data distribution; all parameters are learned from pairwise relationships between input elements, allowing strong performance on both image and text data.

This section first explains the attention mechanism, then outlines the overall Transformer architecture, and finally reviews its academic and industrial applications.

1.1 Attention Mechanism

In practice, attention is often produced by linear layers or U‑Net‑like structures, assigning importance weights to input features so that more important features have greater influence on predictions.

In NLP, attention defines the correlation between words. For example, the words "justice" and "it" have a strong connection, resulting in high attention weights.

In CV, attention indicates the importance of different image regions. Visual Question Answering (VQA) is a typical application where the model focuses on image areas most relevant to the posed question.

Attention is also used in OCR. The SRN algorithm outputs attention maps that highlight regions containing text.

1.2 Transformer Network Structure

1.2.1 Overall Architecture

The original Transformer consists of an encoder (left) and a decoder (right). Inputs are embedded, added to positional encodings, and fed into the encoder; the decoder receives previously generated outputs processed similarly.

Three main encoder‑decoder configurations are used:

Full encoder‑decoder architecture (commonly for machine translation).

Encoder‑only architecture (used for classification, sequence labeling, etc.).

Decoder‑only architecture (used for text generation).

1.2.2 Multi‑Head Attention Module

Given an input feature a , three linear layers produce query i , key i , and value i for each element a_i . Queries are compared with all keys to compute attention weights, which are used to weight the corresponding values. The outputs of all heads are concatenated to form the final multi‑head attention output. The decoder uses a masked version to prevent attending to future tokens.

1.2.3 Encoder Feed‑Forward Module

This module applies a fully connected layer for feature extraction, adds a residual connection to preserve information, and uses layer normalization to accelerate optimization. It is a common feature‑refinement block.

1.3 Applications of Transformer Technology

Initially applied to machine translation, Transformers later faced over‑fitting on small datasets. Google introduced BERT, which pre‑trains on massive unlabeled text and fine‑tunes on downstream tasks. BERT’s fixed maximum length (usually 512) limits handling of long texts, leading to Transformer‑XL, which caches hidden states across segments to process longer sequences, albeit with higher memory and latency.

ELECTRA improves pre‑training efficiency by using a generator‑discriminator approach, allowing 100% of data to be used for masked token prediction, achieving BERT‑level performance with smaller models.

In computer vision, Vision Transformer (ViT) first applied a pure Transformer to images by splitting them into fixed‑size patches and embedding each patch. Swin Transformer refined this idea with hierarchical, shifted‑window attention, making it suitable for dense prediction tasks such as object detection and semantic segmentation.

In OCR, Transformer‑based models (e.g., SRN, MASTER, ABINet, TrOCR, SVTR) replace LSTM encoders, enhance semantic correction, or directly process image patches. However, challenges remain for handling arbitrary image lengths and maintaining speed.

2. Practical Exploration of Transformer Technology at Haodf.com

2.1 Natural Language Processing

Haodf.com’s billions of doctor‑patient dialogues are used to train BERT for text classification, achieving a ten‑fold increase in AI‑assisted triage efficiency.

OCR‑extracted text from millions of uploaded medical images is also processed with BERT for various downstream tasks.

For report‑structuring, BERT performs named‑entity recognition to extract "test item", "value", and "range" fields, achieving an F1 score of 98% on the test set.

Privacy redaction also uses BERT to locate sensitive fields, after which the text positions are mapped back to image regions for de‑identification.

Classification of report types (e.g., blood routine, liver function) with BERT reaches 95.6% accuracy on a random sample of 2,800 reports. ELECTRA‑tiny, pre‑trained on Haodf.com data, runs twice as fast but incurs a ~1.3% accuracy drop, suggesting further pre‑training could close the gap.

2.2 OCR Text Recognition

Transformer‑based OCR research focuses on three areas: replacing LSTM with Transformer for better sequence modeling, using Transformers for semantic correction, and directly applying Transformers to image data.

MASTER demonstrates Transformer replacing LSTM on English datasets, while SVTR (implemented in PyTorch) originally handled fixed‑length (280) image sequences. By switching to a fixed cosine positional encoding, SVTR can process arbitrary lengths, but performs best only on 512‑length inputs, limiting practical deployment.

SRN, a Baidu‑proposed attention‑based OCR model, replaces LSTM with Transformer for feature extraction and semantic correction. On Haodf.com’s 7 million mixed‑language dataset, SRN’s character accuracy is ~2% lower than the existing CRNN, mainly due to attention drift on long Chinese texts.

ABINet’s BCN semantic correction network outperforms SRN, but its attention‑based visual alignment limits Chinese OCR performance. Therefore, we retain a CRNN+CTC visual backbone and design an adaptive matrix A to transform CTC outputs into a format compatible with BCN, enabling end‑to‑end training without gradient issues. The resulting CRNN‑BCN model (ResNet‑34 backbone) fuses visual and semantic results via gated units.

Experimental results on the hand‑labeled CRNN_real_test_all dataset (7,371 images) show CRNN‑BCN achieving 98.68% character accuracy and 95.08% string accuracy, surpassing the baseline CRNN while maintaining comparable inference speed.

table2.2.1 Algorithm comparison results

Model

CRNN_real_test_all (7371) – Char Acc/String Acc (%)

Speed (frames/s)

CRNN

98.12/92.61

66.41

CRNN-BCN

98.68/95.08

57.6

CRNN-BCN (visual only)

98.41/93.77

66.41

2.3 Image Classification

Unlabeled medical images (radiology, body parts, reports) are automatically labeled using deep learning models. Swin Transformer variants were evaluated for image classification.

Four Swin models (tiny, small, base, large) were trained on a medical image dataset. Swin‑large offered no clear accuracy advantage over Swin‑base, while Swin‑small did not improve inference speed. Consequently, Swin‑tiny and Swin‑base were selected for further experiments.

table2.3.1 Model prediction speed

Model

Image Size

Time per Image (s)

Swin‑tiny

224

0.0138

Swin‑base

224

0.0240

EfficientNet‑B4

224

0.0233

On coarse image classification, Swin‑tiny achieved slightly lower accuracy but half the inference time of the other models. Swin‑base performed best overall, comparable to EfficientNet‑B4, with similar GPU usage.

table2.3.2 Coarse classification results

Model

Test Set Accuracy (%)

Online Sample Accuracy (%)

Swin‑tiny

99.26

97.65

Swin‑base

99.17

98.32

EfficientNet‑B4

99.26

98.24

For medical image type and body part recognition, multi‑task training with Swin Transformer yielded strong results, demonstrating the model’s suitability for joint classification tasks.

table2.3.3 Medical image type and part results

Model

Type Test Acc (%)

Part Test Acc (%)

Online Type Acc (%)

Online Part Acc (%)

Swin‑tiny

95.79

92.44

93.14

85.47

Swin‑base

96.91

94.01

94.72

88.83

EfficientNet‑B4

96.14

93.10

93.80

86.73

table2.3.4 Body part dataset results

Model

Test Set Accuracy (%)

Online Sample Accuracy (%)

Swin‑tiny

94.07

91.62

Swin‑base

95.72

93.80

EfficientNet‑B4

93.61

92.68

Overall, Swin Transformer demonstrates strong performance for image tasks, and we plan to adopt it for various business needs.

3. Conclusion

3.1 Advantages and Limitations of Transformers

Unlike CNNs and RNNs, Transformers make no prior assumptions about data distribution, learning all feature relationships from pairwise element interactions. This flexibility enables strong performance across modalities but also makes Transformers prone to over‑fitting on small datasets.

Pre‑training on massive unlabeled data mitigates over‑fitting, yet the required hardware resources remain a barrier.

In NLP, Transformers capture long‑range dependencies without distance constraints, whereas RNNs struggle with distant relationships. In CV, Vision Transformers aggregate global information early, unlike CNNs that focus on local patterns.

3.2 Future Plans

We aim to replace the older BERT pre‑training with faster, more accurate ELECTRA models, leveraging Haodf.com’s extensive doctor‑patient dialogue corpus for further pre‑training.

In OCR, we will continue monitoring academic advances in pure‑image Transformers and prepare to integrate promising methods once they become robust enough for production.

For computer vision, we will explore newer Transformer‑based models (e.g., Swin V2, DERT, COLTran) for tasks such as detection, segmentation, and image generation, building on our existing EfficientNet‑B4 baseline.

References

[1] Ashish Vaswani et al., "Attention is All You Need", NeurIPS 2017.

[2] Fukui A. et al., "Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding", 2016.

[3] Yu D. et al., "Towards Accurate Scene Text Recognition With Semantic Reasoning Networks", CVPR 2020.

[4] Devlin J. et al., "BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding", 2018.

[5] Dai Z. et al., "Transformer‑XL: Attentive Language Models beyond a Fixed‑Length Context", 2019.

[6] Clark K. et al., "ELECTRA: Pre‑training Text Encoders as Discriminators Rather Than Generators", 2020.

[7] Lin T. et al., "A Survey of Transformers", 2021.

[8] Dosovitskiy A. et al., "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale", 2020.

[9] Lu N. et al., "MASTER: Multi‑Aspect Non‑local Network for Scene Text Recognition", 2019.

[10] Shi B. et al., "An End‑to‑End Trainable Neural Network for Image‑Based Sequence Recognition and Its Application to Scene Text Recognition", 2016.

[11] Liu Z. et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", 2021.

[12] Liu Z. et al., "Swin Transformer V2: Scaling Up Capacity and Resolution", 2021.

[13] Raghu M. et al., "Do Vision Transformers See Like Convolutional Neural Networks?", 2021.

[14] Carion N. et al., "End‑to‑End Object Detection with Transformers", 2020.

[15] Doersch C. et al., "CrossTransformers: Spatially‑aware Few‑shot Transfer", 2020.

[16] Kumar M. et al., "Colorization Transformer", 2021.

[17] Plizzari C. et al., "Spatial Temporal Transformer Network for Skeleton‑based Action Recognition", 2020.

[18] Redmon J. et al., "YOLOv3: An Incremental Improvement", 2018.

[19] Tan M. & Le Q. V., "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks", 2019.

[20] Fang S. et al., "Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition", 2021.

[21] Li M. et al., "TrOCR: Transformer‑based Optical Character Recognition with Pre‑trained Models", 2021.

[22] Du Y. et al., "SVTR: Scene Text Recognition with a Single Visual Model", 2022.

------- END -------

computer visiontransformerOCRmodel evaluationNLPSwin Transformer
HaoDF Tech Team
Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.