360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B
The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.
Introduction
With the release of Llama‑3 on April 19, the 360 AI Research Institute’s multimodal team quickly began investigating its potential for general‑purpose multimodal large models (LMM) and recently open‑sourced 360VL, a multimodal model built on Llama‑3‑70B (the next generation of SEEChat). 360VL is the first open‑source multimodal model based on Llama‑3‑70B and incorporates a global‑perception multi‑branch projector architecture that enhances image understanding, achieving strong results on major benchmarks compared with LLaVA‑1.6, MM1, Yi‑VL, etc.
GitHub: https://github.com/360CVGroup/360VL
HuggingFace: https://huggingface.co/qihoo360/360VL-70B
Model Architecture
360VL is a vision‑multimodal large language model that supports multiple visual tasks. It follows a visual encoder‑bridge‑LLM design, where the bridge layer aligns visual tokens for the language model. The bridge layer is crucial; unlike simple linear or resampler layers used by many LMMs, 360VL adopts a C‑abs structure that adds convolutional modules to better extract visual features while handling high‑resolution images without exploding token counts.
For high‑resolution inputs, images are split into four 336×336 patches plus a resized whole image. The visual encoder produces features of size 4·L·Dim and 1·L·Dim respectively; the patch features are dimension‑converted to maintain spatial information, concatenated, and fed to the LLM. Inside C‑abs, an adaptive interpolation encoding expands positional encodings, and AdaptivePooling reduces the concatenated 4·L·Dim back to L·Dim.
The training follows a two‑stage strategy. Pre‑training mainly uses the llava‑1.5 dataset with a small amount of additional data. For SFT, the open‑source llava_v1_5_mix665k dataset and extra bilingual multimodal data are used, improving instruction following especially for Chinese tasks.
Experimental Results
360VL was evaluated in zero‑shot benchmarks and achieved competitive scores across multiple metrics. Comparisons across different vision backbones (CLIP‑ViT‑L, siglip‑so400m, DFN5B‑CLIP‑ViT) and language models (Vicuna‑1.5‑7B/13B, Llama‑3‑8B/70B) show that siglip‑so400m and CLIP‑ViT‑L perform best, and Llama‑3 models consistently outperform Vicuna‑1.5, with Llama‑3‑70B delivering the highest performance.
Additional experiments expanding the visual token dimension to 6·L·Dim improved accuracy but reduced efficiency, so the final model kept the original configuration.
Specific Capabilities
Chinese language ability, OCR, image understanding, and localization performance are illustrated in the following figures.
Outlook
In summary, the researchers explored Llama‑3’s multimodal capabilities and released the 360VL model based on Llama‑3‑70B. As a general‑purpose multimodal model, 360VL demonstrates strong performance on visual question answering and content creation tasks, and the team plans to continue open‑sourcing future versions.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.