Artificial Intelligence 13 min read

WeChat OCR: Implementation of Image Text Extraction Feature

WeChat’s 8.0 update introduced an OCR pipeline that first quickly detects text in images, classifies the image type, applies a lightweight multi‑language detection network and a MobileNetV3‑based DBNet recognizer with a multi‑task CTC/Attention model, then merges results via a rule‑based layout analyzer to deliver accurate, well‑formatted extracted text across diverse languages and document types.

Tencent Cloud Developer

Mar 4, 2021

WeChat OCR: Implementation of Image Text Extraction Feature

In January 2021, WeChat released version 8.0, introducing the image text extraction feature that allows users to extract text from images in chat and Moments by long-pressing. This article introduces how WeChat's OCR capability was implemented for this text extraction business.

Background and Challenges:

The implementation faced several key challenges: (1) Determining whether an image contains text among diverse content like products, people, landscapes, and vehicles; (2) Classifying text image types including ID documents, handwritten text, and tables to select appropriate recognition models; (3) Optimizing general recognition algorithms to balance effect and efficiency; (4) Performing layout analysis to merge recognized text lines into readable paragraphs.

Overall Solution:

The solution consists of four main modules: (1) Fast text detection module to quickly determine if text exists in images and trigger the extraction entry; (2) Text image classification module to identify whether images are specialized documents or general text; (3) General text recognition including text detection (localizing text regions) and text recognition (identifying text content); (4) Layout analysis module to arrange recognized text in a readable format.

Key Technical Implementations:

For fast text detection, a lightweight multi-language text classification network was developed supporting Latin (English), Chinese, Japanese, Korean, Thai, Russian, and Vietnamese. The module uses an ultra-lightweight CNN network with ~80ms average processing time on mobile devices.

For text detection, DBNet algorithm based on instance segmentation was adopted. The backbone uses MobileNetV3 with model distillation (ResNet50 as teacher, MobileNetV3 as student), improving performance by 1 point. TensorRT deployment achieves less than 30ms average processing time on T4 machines.

For text recognition, a multi-task model combining CTC OCR, Attention OCR, and ACE methods was developed. Data synthesis tools including TextRender and StyleText were used to generate training data. Focal loss and center loss were added to address rare characters and similar-looking characters, improving accuracy by 2-3 points on similar character test sets.

For layout analysis, a self-developed geometric rule-based method using DFS text box merging was adopted for better flexibility and faster badcase fixes.

Results:

The solution offers advantages including vertical text recognition, precise ID image extraction, better layout formatting, and filtering of meaningless text.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Deep Learning OCR Layout Analysis WeChat text detection text recognition DBNet Optical Character Recognition

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.