dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance
dots.vlm1, the first open‑source multimodal large model from Xiaohongshu hi‑lab, combines a 1.2‑billion‑parameter NaViT visual encoder with DeepSeek V3 LLM, achieving near‑state‑of‑the‑art visual understanding and reasoning while remaining competitive on text tasks, and is available on GitHub and HuggingFace.
Overview
dots.vlm1 is the first multimodal large model released by Xiaohongshu hi‑lab as open source. It integrates a 1.2 billion‑parameter NaViT visual encoder trained from scratch with the DeepSeek V3 large language model, providing strong visual‑language understanding and reasoning capabilities.
Model Highlights
NaViT visual encoder : Trained from zero without fine‑tuning on existing encoders, supports dynamic resolution, and incorporates pure visual supervision to raise perception limits. Training data includes traditional image‑caption pairs plus a large amount of structured images to improve OCR abilities.
Multimodal training data : In addition to standard data, synthetic data covering diverse image types (tables, charts, documents, graphics) and descriptions (Alt‑Text, dense captions, grounding) are added. A multimodal large model rewrites web‑page image‑text data to significantly improve data quality.
Performance : After large‑scale pre‑training and fine‑tuning, dots.vlm1 reaches near‑SOTA results on most multimodal benchmarks, matching closed‑source models such as Gemini 2.5 Pro and Seed‑VL1.5 on tasks like MMMU, MathVision, and OCR reasoning, while remaining competitive on pure text benchmarks.
Resources
GitHub repository: https://github.com/rednote-hilab/dots.vlm1
HuggingFace model: https://huggingface.co/rednote-hilab/dots.vlm1.inst
Demo: https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo
Evaluation
On major visual benchmarks, dots.vlm1’s overall performance is close to leading proprietary models. It shows strong results on MMMU, MathVision, and OCR reasoning, indicating robust image‑text understanding and reasoning.
For typical text reasoning tasks (AIME, GPQA, LiveCodeBench), dots.vlm1 performs comparably to DeepSeek‑R1‑0528, demonstrating decent generality in mathematics and code, though a gap remains on more diverse reasoning tasks such as GPQA.
Sample Outputs
Complex chart reasoning
STEM problem solving
Long‑tail recognition
Architecture Overview
dots.vlm1 consists of three core components: a 1.2 billion‑parameter NaViT visual encoder, a lightweight MLP adapter, and the DeepSeek V3 MoE large language model. Training proceeds in three stages:
Stage 1 – Visual encoder pre‑training : NaViT is trained from random initialization on 224×224 images using dual supervision (next‑token prediction on image‑text pairs and next‑patch generation via diffusion) to boost spatial and semantic perception.
Stage 2 – VLM pre‑training : The visual encoder and DeepSeek V3 LLM are jointly trained on a massive, diverse multimodal dataset.
Stage 3 – VLM fine‑tuning : Supervised fine‑tuning (SFT) on task‑diverse data enhances generalization; reinforcement learning is planned for future work.
Vision Encoder Details
The NaViT encoder uses a 42‑layer Transformer with RMSNorm, SwiGLU, and 2‑D RoPE. Training follows a two‑phase strategy: initial pre‑training at native resolution with next‑token and next‑patch objectives, followed by progressive resolution scaling up to tens of millions of pixels, incorporating OCR scenes, grounding data, and video frames.
Multimodal Pre‑training Data
Data is divided into two main categories:
Cross‑modal translation data : Images paired with Alt‑Text/Dense Caption, complex charts/tables/formulas with structured annotations, OCR scenes, video frames with temporal descriptions, and grounding supervision (bounding boxes, keypoints).
Cross‑modal fusion data : Mixed image‑text contexts to train next‑token prediction across modalities. Notable sources include web data (cleaned with an internal VLM rewrite pipeline) and PDF data (processed with the dots.ocr model that renders PDFs as images and masks text to teach layout understanding).
Future Directions
To close remaining gaps, the team plans to expand cross‑modal translation data, improve the visual encoder architecture, explore more effective loss functions, and incorporate reinforcement learning for better visual reasoning. Enhancing pre‑training to embed more reasoning ability is also a priority.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
