dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance
dots.vlm1, the first open‑source multimodal large model from Xiaohongshu hi‑lab, combines a 1.2‑billion‑parameter NaViT visual encoder with DeepSeek V3 LLM, achieving near‑state‑of‑the‑art visual understanding and reasoning while remaining competitive on text tasks, and is available on GitHub and HuggingFace.
