Artificial Intelligence 13 min read

WeChat OCR: Implementation of Image Text Extraction Feature

WeChat’s 8.0 update introduced an OCR pipeline that first quickly detects text in images, classifies the image type, applies a lightweight multi‑language detection network and a MobileNetV3‑based DBNet recognizer with a multi‑task CTC/Attention model, then merges results via a rule‑based layout analyzer to deliver accurate, well‑formatted extracted text across diverse languages and document types.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
WeChat OCR: Implementation of Image Text Extraction Feature

In January 2021, WeChat released version 8.0, introducing the image text extraction feature that allows users to extract text from images in chat and Moments by long-pressing. This article introduces how WeChat's OCR capability was implemented for this text extraction business.

Background and Challenges:

The implementation faced several key challenges: (1) Determining whether an image contains text among diverse content like products, people, landscapes, and vehicles; (2) Classifying text image types including ID documents, handwritten text, and tables to select appropriate recognition models; (3) Optimizing general recognition algorithms to balance effect and efficiency; (4) Performing layout analysis to merge recognized text lines into readable paragraphs.

Overall Solution:

The solution consists of four main modules: (1) Fast text detection module to quickly determine if text exists in images and trigger the extraction entry; (2) Text image classification module to identify whether images are specialized documents or general text; (3) General text recognition including text detection (localizing text regions) and text recognition (identifying text content); (4) Layout analysis module to arrange recognized text in a readable format.

Key Technical Implementations:

For fast text detection, a lightweight multi-language text classification network was developed supporting Latin (English), Chinese, Japanese, Korean, Thai, Russian, and Vietnamese. The module uses an ultra-lightweight CNN network with ~80ms average processing time on mobile devices.

For text detection, DBNet algorithm based on instance segmentation was adopted. The backbone uses MobileNetV3 with model distillation (ResNet50 as teacher, MobileNetV3 as student), improving performance by 1 point. TensorRT deployment achieves less than 30ms average processing time on T4 machines.

For text recognition, a multi-task model combining CTC OCR, Attention OCR, and ACE methods was developed. Data synthesis tools including TextRender and StyleText were used to generate training data. Focal loss and center loss were added to address rare characters and similar-looking characters, improving accuracy by 2-3 points on similar character test sets.

For layout analysis, a self-developed geometric rule-based method using DFS text box merging was adopted for better flexibility and faster badcase fixes.

Results:

The solution offers advantages including vertical text recognition, precise ID image extraction, better layout formatting, and filtering of meaningless text.

computer visiondeep learningOCRlayout analysisWeChattext detectiontext recognitionDBNetOptical Character Recognition
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.