Efficient Scene Text Detection Framework with Feature Pyramid and Expanded High-Level Feature Maps

The paper presents an efficient scene‑text detector that expands high‑level SSD feature maps and integrates a feature‑pyramid network, using direction‑aware segment‑and‑link predictions to reconstruct arbitrarily long, rotated text, achieving higher recall and precision with real‑time speed and outperforming recent methods on ICDAR benchmarks and a menu‑recognition test.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Efficient Scene Text Detection Framework with Feature Pyramid and Expanded High-Level Feature Maps

Scene text detection is crucial for many applications but remains challenging due to large variations in aspect ratio, scale, and orientation.

This work proposes an efficient detection framework that combines an expanded high‑level feature map with a feature‑pyramid network (FPN) built on top of an SSD backbone. Text lines are decomposed into small, direction‑aware segments; a 8‑neighbor link predicts connections between segments, allowing reconstruction of arbitrarily long and rotated text.

Key components:

Interval sampling to enlarge high‑level feature maps, preserving resolution for small texts.

Fusion of deep and shallow features to construct a multi‑level pyramid (conv4_3_f, fc7_f, conv6_2_f, …) with 256‑dimensional channels.

Segment‑and‑link prediction on each pyramid level, modeling eight possible neighbor relations.

Geometric post‑processing that fits a line to linked segments and derives final bounding boxes.

Experiments on ICDAR2013 and ICDAR2015 show that expanding high‑level maps improves recall, while adding the pyramid further boosts precision. Compared with TextBoxes++, PixelLink and other state‑of‑the‑art methods, the proposed approach achieves a favorable trade‑off between speed (FPS) and accuracy.

The system is also deployed in a real‑world menu‑recognition scenario, where it outperforms SegLink by about 5 % on a 500‑image test set.

Future work will explore pixel‑level segmentation (inspired by PixelLink) and joint detection‑segmentation architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionDeep LearningSSDICDARScene Text Detectionfeature pyramid network
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.