Artificial Intelligence 16 min read

How Deep Learning Powers Text Detection in E‑commerce Posters

This article surveys state‑of‑the‑art deep‑learning techniques for scene text detection and recognition in e‑commerce poster images, detailing models such as CTPN, TextBoxes, SegLink, EAST, and end‑to‑end frameworks, and discusses their architectures, strengths, limitations, and future challenges.

JD Retail Technology

Apr 2, 2020

Text Detection

Text detection locates text regions in images, a specialized object‑detection problem with extreme aspect‑ratio and orientation variations.

CTPN

The Connectionist Text Proposal Network (CTPN) uses a VGG‑16 backbone up to conv5 to produce a feature map of size N×W×H×C. A dense 3×3 sliding window extracts a 3×3×C vector at each spatial location, which is fed row‑wise into a bidirectional LSTM to capture sequential context. A fully‑connected layer then predicts for each of k horizontal anchors (width ≈ 16 px, height ≈ 11‑273 px):

vertical centre‑y and height offsets,

text/non‑text confidence scores,

side‑refinement offsets.

CTPN works well for horizontally aligned text but cannot handle rotated text.

TextBoxes and TextBoxes++

Both models extend the single‑stage SSD detector. TextBoxes replaces the standard 3×3 kernels with 1×5 kernels to better fit elongated boxes and defines six default aspect ratios (1, 2, 3, 5, 7, 10) with a vertical offset to increase density. TextBoxes++ adds support for arbitrary orientations by: Extending default aspect ratios to 1, 2, 3, 5, 1/2, 1/3, 1/5, Using 3×5 kernels in the text‑box layers, Predicting rotated boxes (RBox) and quadrangles (QUAD) in addition to horizontal box offsets. SegLink SegLink decomposes detection into segments (boxes covering individual words or text lines) and links (connections between adjacent segments belonging to the same word). The backbone mirrors SSD with six convolutional layers that predict multi‑scale segment boxes in RBOX format. Two link modules are employed: Within‑layer link detection connects segments on the same feature map, Cross‑layer link detection connects segments on adjacent feature maps. EAST EAST (Efficient and Accurate Scene Text) is a fully convolutional, U‑Net‑like detector. It outputs a single‑channel score map and multi‑channel geometry maps encoding either rotated boxes (RBox) or quadrangles (QUAD). Feature merging combines multi‑scale features for size‑invariant detection. Overlapping candidates are merged with locality‑aware NMS (O(n) complexity). The loss consists of a class‑balanced cross‑entropy term for the score map and an IoU loss for the geometry maps. Text Recognition Four common strategies are surveyed. CNN+Softmax A convolutional backbone extracts a feature tensor H . The tensor is split into N+1 positions (one per character plus a blank) and each position is classified by an independent softmax layer. This simple pipeline handles variable‑length sequences but lacks contextual modeling, making it suitable only for short, fixed‑alphabet tasks. CNN+RNN+CTC (CRNN) The image is processed by a CNN to obtain a feature map, which is reshaped into a sequential representation (time steps correspond to width). A bidirectional RNN (typically LSTM) encodes the sequence, and a CTC loss aligns the RNN outputs with the target label sequence, allowing end‑to‑end training without explicit segmentation. CNN+RNN+Attention An encoder‑decoder architecture adds an attention mechanism. The CNN extracts visual features, the encoder (bidirectional LSTM) processes the feature sequence, and the decoder (LSTM) attends to relevant encoder states at each decoding step, producing characters one‑by‑one. This improves recognition of long or complex strings. CNN+Stacked CNN+CTC To avoid the sequential bottleneck of RNNs, this design stacks convolutional layers (dense connections with residual attention modules) to model contextual dependencies. The pipeline consists of: Attention feature encoder : densely connected CNN with residual attention extracts a sequence of feature vectors, each linked to a local image region. Convolutional sequence modeling : several stacked conv layers enlarge the receptive field, producing a hierarchical representation of the sequence. CTC decoder : the final sequence is decoded with the CTC loss. End‑to‑End Text Spotting Towards End‑to‑End Text Spotting with Conventional RNN A single image is processed to output both bounding boxes and transcriptions. The architecture includes: VGG‑16 backbone extracts a convolutional feature map. Text Proposal Network (TPN) generates candidate text regions. Region Feature Encoder (RFE) resamples each proposal according to its aspect ratio, then encodes the resampled feature map with a bidirectional RNN into a fixed‑length vector. Text Detection Network (TDN) predicts a textness score and box offsets from the vector. Text Recognition Network (TRN) decodes the fixed‑length vector (e.g., with CTC) to produce the character sequence. STN‑OCR Spatial Transformer Network OCR learns only image‑level transcription labels. Detection consists of three components: Localization Network predicts N affine transformation matrices. Grid Generator creates N sampling grids from the matrices. Image Sampling extracts the N regions using bilinear interpolation. The recognition stage applies a standard CNN+Softmax classifier to each sampled region. FOTS (Fast Oriented Text Spotting) FOTS shares a ResNet‑50 backbone with Feature Pyramid Network (FPN) to produce multi‑scale feature maps at 1/4, 1/8, 1/16 and 1/32 of the input size. The detection branch (inspired by EAST) predicts a textness score and geometry (rotated box) for each spatial location. RoIRotate aligns each rotated proposal to an axis‑aligned rectangle via an affine transform. The recognition branch, identical to CRNN, encodes the aligned features with a bidirectional RNN and decodes them with CTC. Challenges in E‑commerce Poster Scenarios Text in real‑world e‑commerce posters exhibits extreme variability: multiple languages, diverse fonts, arbitrary rotations, complex layouts, and interference from logos, patterns, noise, blur, low resolution, distortion, and occlusion. Robust detection‑recognition pipelines must handle these factors while satisfying real‑time latency requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce computer vision deep learning text detection scene text recognition

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.