Artificial Intelligence 8 min read

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.

360 Tech Engineering

May 17, 2024

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

Introduction

With the release of Llama‑3 on April 19, the 360 AI Research Institute’s multimodal team quickly began investigating its potential for general‑purpose multimodal large models (LMM) and recently open‑sourced 360VL, a multimodal model built on Llama‑3‑70B (the next generation of SEEChat). 360VL is the first open‑source multimodal model based on Llama‑3‑70B and incorporates a global‑perception multi‑branch projector architecture that enhances image understanding, achieving strong results on major benchmarks compared with LLaVA‑1.6, MM1, Yi‑VL, etc.

GitHub: https://github.com/360CVGroup/360VL

HuggingFace: https://huggingface.co/qihoo360/360VL-70B

Model Architecture

360VL is a vision‑multimodal large language model that supports multiple visual tasks. It follows a visual encoder‑bridge‑LLM design, where the bridge layer aligns visual tokens for the language model. The bridge layer is crucial; unlike simple linear or resampler layers used by many LMMs, 360VL adopts a C‑abs structure that adds convolutional modules to better extract visual features while handling high‑resolution images without exploding token counts.

For high‑resolution inputs, images are split into four 336×336 patches plus a resized whole image. The visual encoder produces features of size 4·L·Dim and 1·L·Dim respectively; the patch features are dimension‑converted to maintain spatial information, concatenated, and fed to the LLM. Inside C‑abs, an adaptive interpolation encoding expands positional encodings, and AdaptivePooling reduces the concatenated 4·L·Dim back to L·Dim.

The training follows a two‑stage strategy. Pre‑training mainly uses the llava‑1.5 dataset with a small amount of additional data. For SFT, the open‑source llava_v1_5_mix665k dataset and extra bilingual multimodal data are used, improving instruction following especially for Chinese tasks.

Experimental Results

360VL was evaluated in zero‑shot benchmarks and achieved competitive scores across multiple metrics. Comparisons across different vision backbones (CLIP‑ViT‑L, siglip‑so400m, DFN5B‑CLIP‑ViT) and language models (Vicuna‑1.5‑7B/13B, Llama‑3‑8B/70B) show that siglip‑so400m and CLIP‑ViT‑L perform best, and Llama‑3 models consistently outperform Vicuna‑1.5, with Llama‑3‑70B delivering the highest performance.

Additional experiments expanding the visual token dimension to 6·L·Dim improved accuracy but reduced efficiency, so the final model kept the original configuration.

Specific Capabilities

Chinese language ability, OCR, image understanding, and localization performance are illustrated in the following figures.

Outlook

In summary, the researchers explored Llama‑3’s multimodal capabilities and released the 360VL model based on Llama‑3‑70B. As a general‑purpose multimodal model, 360VL demonstrates strong performance on visual question answering and content creation tasks, and the team plans to continue open‑sourcing future versions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model Multimodal AI research vision-language Llama3

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.