dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance

dots.vlm1, the first open‑source multimodal large model from Xiaohongshu hi‑lab, combines a 1.2‑billion‑parameter NaViT visual encoder with DeepSeek V3 LLM, achieving near‑state‑of‑the‑art visual understanding and reasoning while remaining competitive on text tasks, and is available on GitHub and HuggingFace.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance

Overview

dots.vlm1 is the first multimodal large model released by Xiaohongshu hi‑lab as open source. It integrates a 1.2 billion‑parameter NaViT visual encoder trained from scratch with the DeepSeek V3 large language model, providing strong visual‑language understanding and reasoning capabilities.

Model Highlights

NaViT visual encoder : Trained from zero without fine‑tuning on existing encoders, supports dynamic resolution, and incorporates pure visual supervision to raise perception limits. Training data includes traditional image‑caption pairs plus a large amount of structured images to improve OCR abilities.

Multimodal training data : In addition to standard data, synthetic data covering diverse image types (tables, charts, documents, graphics) and descriptions (Alt‑Text, dense captions, grounding) are added. A multimodal large model rewrites web‑page image‑text data to significantly improve data quality.

Performance : After large‑scale pre‑training and fine‑tuning, dots.vlm1 reaches near‑SOTA results on most multimodal benchmarks, matching closed‑source models such as Gemini 2.5 Pro and Seed‑VL1.5 on tasks like MMMU, MathVision, and OCR reasoning, while remaining competitive on pure text benchmarks.

Resources

GitHub repository: https://github.com/rednote-hilab/dots.vlm1

HuggingFace model: https://huggingface.co/rednote-hilab/dots.vlm1.inst

Demo: https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo

Evaluation

On major visual benchmarks, dots.vlm1’s overall performance is close to leading proprietary models. It shows strong results on MMMU, MathVision, and OCR reasoning, indicating robust image‑text understanding and reasoning.

For typical text reasoning tasks (AIME, GPQA, LiveCodeBench), dots.vlm1 performs comparably to DeepSeek‑R1‑0528, demonstrating decent generality in mathematics and code, though a gap remains on more diverse reasoning tasks such as GPQA.

Sample Outputs

Complex chart reasoning

Complex chart reasoning example
Complex chart reasoning example

STEM problem solving

STEM solving example
STEM solving example

Long‑tail recognition

Long‑tail recognition example
Long‑tail recognition example

Architecture Overview

Architecture diagram
Architecture diagram

dots.vlm1 consists of three core components: a 1.2 billion‑parameter NaViT visual encoder, a lightweight MLP adapter, and the DeepSeek V3 MoE large language model. Training proceeds in three stages:

Stage 1 – Visual encoder pre‑training : NaViT is trained from random initialization on 224×224 images using dual supervision (next‑token prediction on image‑text pairs and next‑patch generation via diffusion) to boost spatial and semantic perception.

Stage 2 – VLM pre‑training : The visual encoder and DeepSeek V3 LLM are jointly trained on a massive, diverse multimodal dataset.

Stage 3 – VLM fine‑tuning : Supervised fine‑tuning (SFT) on task‑diverse data enhances generalization; reinforcement learning is planned for future work.

Vision Encoder Details

The NaViT encoder uses a 42‑layer Transformer with RMSNorm, SwiGLU, and 2‑D RoPE. Training follows a two‑phase strategy: initial pre‑training at native resolution with next‑token and next‑patch objectives, followed by progressive resolution scaling up to tens of millions of pixels, incorporating OCR scenes, grounding data, and video frames.

Multimodal Pre‑training Data

Data is divided into two main categories:

Cross‑modal translation data : Images paired with Alt‑Text/Dense Caption, complex charts/tables/formulas with structured annotations, OCR scenes, video frames with temporal descriptions, and grounding supervision (bounding boxes, keypoints).

Cross‑modal fusion data : Mixed image‑text contexts to train next‑token prediction across modalities. Notable sources include web data (cleaned with an internal VLM rewrite pipeline) and PDF data (processed with the dots.ocr model that renders PDFs as images and masks text to teach layout understanding).

Future Directions

To close remaining gaps, the team plans to expand cross‑modal translation data, improve the visual encoder architecture, explore more effective loss functions, and incorporate reinforcement learning for better visual reasoning. Enhancing pre‑training to embed more reasoning ability is also a priority.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIvision-languageopen-sourcedeep-learninglarge-model
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.