How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

In an interview before the DACon conference, Dr. Feng Ziyong reveals how Glint‑MVT and novel data‑synthesis techniques overcome distribution gaps, improve compositional understanding, and enable billion‑scale, second‑level retrieval for city‑level surveillance, while balancing model efficiency and effectiveness.

DataFunTalk
DataFunTalk
DataFunTalk
How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

Before the DACon conference, Dr. Feng Ziyong, Vice President of Technology at Glint Deep Vision, shared core insights and breakthroughs from his team’s work on deploying multimodal AI in city‑scale security applications.

He highlighted the challenges of applying CLIP‑like universal models to surveillance, where data distribution, object scale, clarity, and Chinese compositional semantics differ from internet‑scale training data, and proposed a pragmatic path: “strengthen single modality first, then align multimodality.”

The self‑developed visual foundation model Glint‑MVT uses an interval Softmax loss and millions of virtual classes to build a highly discriminative visual representation base, supplying high‑quality image embeddings for downstream multimodal models such as RWKV‑CLIP and UniME.

To mitigate the scarcity of high‑quality image‑text pairs, the team introduced RealSyn data construction and ALIP/CLIP‑CID data‑purification techniques. These methods raised compositional understanding by about 30% and retrieval accuracy by roughly 35% in vertical scenarios like security.

RealSyn leverages abundant non‑paired multimodal documents, creates a curated image and sentence pool, and matches each image with multiple semantically related captions from different contexts, thereby improving robustness and generalization.

ALIP generates more compatible synthetic captions, while CLIP‑CID removes redundant data, cutting training costs by 50‑60% without sacrificing performance.

Noise‑filtering mechanisms in RealSyn include single‑modality quality filtering, perceptual and semantic redundancy removal, and cross‑modality similarity filtering with semantic balanced sampling.

The DeGLA approach enhances compositional understanding via self‑distillation, achieving a 30% boost on a million‑scale benchmark while preserving the model’s general capability.

In the real‑time system handling millions of camera streams, a lightweight image encoder extracts features stored in a custom vector database, enabling billion‑scale feature retrieval within seconds; text queries are encoded by a larger text encoder and matched against this database.

Looking forward, Dr. Feng stresses that continuous iteration of data and models is essential, but the key to balancing effectiveness and efficiency lies in efficiently integrating existing foundation models through distillation and transfer learning.

multimodal AImodel distillationdata synthesisembedding retrievalcity surveillancevisual foundation model
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.