Artificial Intelligence 20 min read

How Open‑Vocabulary Detection and Segment‑Anything Are Revolutionizing Visual AI at Huolala

This article reviews traditional computer‑vision tasks—classification, detection, and segmentation—highlights their limitations, introduces open‑vocabulary detection and segment‑anything models such as GLIP, Grounding DINO, and SAM, and details how Huolala applies these advances to driver‑license, packing, and vehicle‑sticker inspections for safer, more efficient AI‑driven operations.

Huolala Tech

Jan 25, 2024

How Open‑Vocabulary Detection and Segment‑Anything Are Revolutionizing Visual AI at Huolala

Introduction

Recognition is the most researched and widely applied visual task in computer vision (CV), with mature technology and tools playing a crucial role in real‑world projects.

Recognition aims to answer “what is it” by identifying objects in images or videos. According to the level of understanding, visual recognition can be divided into three sub‑tasks:

Image‑level understanding – classification

Object‑level understanding – detection

Pixel‑level understanding – segmentation

In Huolala’s business scenarios, detection and segmentation are especially important for cargo safety monitoring, vehicle‑sticker recognition, OCR, and mobile‑side algorithms.

Traditional detection and segmentation algorithms are lightweight and require few computational resources, but they have limitations:

Model performance depends on data quality and quantity.

Models often need fine‑tuning to transfer to new scenes.

New tasks require retraining from scratch.

With the rapid development of large models, new multimodal tasks such as open‑vocabulary detection and “segment‑anything” have emerged, complementing traditional methods.

The following sections review detection and segmentation tasks, their evolution in the era of large models, and Huolala’s practical applications.

Detection

Closed‑Set Detection & Open‑Set Detection

Traditional closed‑set detection can only detect categories present in the training set. It is mature, accurate, low‑cost, and widely deployed in Huolala’s automation pipelines.

Traditional open‑set detection extends closed‑set detection by also detecting unknown foreground objects, labeling them as “unknown”. In practice, open‑set detection faces challenges such as ambiguous foreground definitions and lower metrics, and is less frequently deployed.

Open‑Vocabulary Detection

Phrase Grounding is a classic vision‑language task that locates noun categories from text in an image. Open‑vocabulary detection is inspired by phrase grounding and aims to detect arbitrary categories via text prompts (zero‑shot).

Representative models include GLIP and Grounding DINO .

GLIP

In December 2021, Microsoft introduced GLIP (Grounded Language‑Image Pre‑training), which unifies object detection and phrase grounding. It uses separate text and visual encoders, adds fusion modules, and aligns detection‑box features with text features.

GLIP learns robust, semantically rich, text‑aware representations and exhibits strong few‑shot downstream transfer, as shown by prompt‑tuning results.

Grounding DINO

In March 2023, IDEA and several universities released Grounding DINO, which builds on GLIP with enhanced multimodal fusion, achieving state‑of‑the‑art open‑set detection.

Grounding DINO demonstrates excellent zero‑shot performance and surpasses closed‑set detectors after fine‑tuning on COCO.

Compared with traditional open‑set detection, GLIP emphasizes few‑shot transfer, while Grounding DINO focuses on zero‑shot capability.

Segmentation

Semantic, Instance, and Panoptic Segmentation

Three fundamental segmentation tasks are widely used:

Semantic segmentation assigns a class label to each pixel without distinguishing instances.

Instance segmentation separates individual objects of the same class.

Panoptic segmentation combines both, handling foreground and background while distinguishing instances.

Reference Segmentation, Interactive Segmentation, Binary Segmentation

New tasks have emerged:

Reference segmentation segments a semantic target based on a textual prompt.

Interactive segmentation predicts segmentation from user clicks or boxes.

Binary (saliency) segmentation separates foreground from background, emphasizing fine details.

Semantic‑Agnostic “Segment‑Anything”

Semantic‑agnostic segment‑anything aims to segment any region in any scene without caring about semantic class. Representative models include SAM and its lightweight variants.

SAM

In April 2023, Meta released SAM (Segment Anything Model), which uses an image encoder, a prompt encoder, and a mask decoder to output segmentation masks.

SAM shares the same interactive workflow as interactive segmentation but can segment arbitrary content with fine granularity.

Lightweight SAM variants such as FastSAM and MobileSAM reduce parameters and inference latency.

MobileSAM distills the original SAM encoder to achieve a smaller model with comparable performance.

Semantic‑Related “Segment‑Anything”

To add semantic labels to SAM’s masks, works such as SSA , Semantic‑SAM , and Grounded‑SAM combine semantic segmentation with SAM.

SSA

SSA merges a semantic segmentation model (semantic branch) with SAM (mask branch) and uses a semantic voting module to assign the most frequent class within each mask as its label.

Semantic‑SAM

Proposed in July 2023 by Microsoft, IDEA, and HKUST, Semantic‑SAM outputs masks at six granularity levels for each prompt point, trained on SA‑1B and other segmentation datasets.

Grounded‑SAM

Grounded‑SAM combines Grounding DINO’s open‑set detection (providing coarse boxes) with SAM’s precise masks to achieve semantic segment‑anything.

Practice at Huolala

Driver‑License Recognition

To protect driver and user data, Huolala replaces manual annotation with a pipeline that uses open‑vocabulary detection as a teacher model to generate high‑confidence bounding boxes, filters low‑quality results, and then distills a lightweight mobile model that achieves both high precision and recall without human labeling.

Packing‑Audit

In the moving‑service workflow, open‑vocabulary detection is prompt‑tuned with textual descriptions of packing styles, then distilled into a traditional closed‑set detector, enabling high‑precision detection of diverse packaging items while keeping computational cost low.

Vehicle‑Sticker Audit

Huolala designs a feature‑matching pipeline that aligns vehicle images, extracts sticker regions, and compares them. To handle varying camera models and distortions, a “segment‑anything” model is used to isolate the sticker area before feature matching.

Conclusion & Outlook

In the era of large models, detection is dominated by open‑vocabulary approaches (e.g., GLIP, Grounding DINO) that enable zero‑shot detection and efficient few‑shot transfer. Segmentation focuses on semantic‑agnostic and semantic‑related “segment‑anything” models (e.g., SAM, Semantic‑SAM). Huolala has integrated these technologies to enhance data security and model production efficiency, and will continue to expand algorithm capabilities and impact across its operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision object detection open-vocabulary Segmentation

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.