Artificial Intelligence 36 min read

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek-VL2 is a state‑of‑the‑art multimodal large model based on a Mixture‑of‑Experts (MoE) architecture, capable of processing images, text, audio and video for tasks such as image‑text understanding, visual QA, document comprehension and scene description.

The model consists of three core modules: a Vision Encoder (SigLIP‑L with dynamic tiling), a VL Adaptor (two‑layer MLP with 2×2 pixel‑shuffle), and a DeepSeek‑MoE language model that uses Multi‑head Latent Attention (MLA) and sparse activation to improve inference efficiency.

Dynamic image‑tiling selects the best resolution from a candidate set CR = {(m·384, n·384) …} by minimizing wasted pixels, enabling high‑resolution inputs to be split into 384×384 tiles. The selected resolution is used to generate a global view and multiple local views, which are tokenised and fed to the LLM.

def select_best_resolution(image_size, candidate_resolutions):
    original_width, original_height = image_size
    best_fit = None
    max_effective_resolution = 0
    min_wasted_resolution = float("inf")
    for width, height in candidate_resolutions:
        scale = min(width / original_width, height / original_height)
        downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
        effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
        wasted_resolution = width * height - effective_resolution
        if (effective_resolution > max_effective_resolution or
            (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution)):
            max_effective_resolution = effective_resolution
            min_wasted_resolution = wasted_resolution
            best_fit = (width, height)
    return best_fit

The training pipeline is divided into three stages: (1) initial alignment where the vision encoder and VL adaptor are trained while the LLM is frozen, (2) pre‑training on a mixed visual‑language and pure‑text corpus (≈70 % VL data, 30 % text), and (3) supervised fine‑tuning on high‑quality multimodal instruction data. Various data sources are used, including ShareGPT‑4V, WIT, WikiHow, OBELICS, OCR datasets, VQA, table‑chart, and internal Chinese QA collections.

Evaluation on benchmarks such as DocVQA, ChartQA, InfoVQA and TextVQA shows strong performance in visual grounding, multi‑image dialogue, and visual story generation.

Implementation details are provided in PaddleMIX. Key code excerpts include the tokeniser that handles tags, the MoE attention layer, the MoE routing gate, and the expert network.

def tokenize_with_images(self, conversation: str, images: List[Image.Image], bos: bool=True, eos: bool=True, cropping: bool=True):
    assert conversation.count(self.image_token) == len(images)
    text_splits = conversation.split(self.image_token)
    images_list, images_seq_mask, images_spatial_crop = [], [], []
    num_image_tokens = []
    tokenized_str = []
    for text_sep, image in zip(text_splits, images):
        tokenized_sep = self.encode(text_sep, bos=False, eos=False)
        tokenized_str += tokenized_sep
        images_seq_mask += [False] * len(tokenized_sep)
        if cropping:
            best_width, best_height = select_best_resolution(image.size, self.candidate_resolutions)
        else:
            best_width, best_height = self.image_size, self.image_size
        # global view
        global_view = ImageOps.pad(image, (self.image_size, self.image_size),
                                   color=tuple(int(x*255) for x in self.image_transform.mean))
        images_list.append(self.image_transform(global_view))
        # local views
        local_view = ImageOps.pad(image, (best_width, best_height),
                                   color=tuple(int(x*255) for x in self.image_transform.mean))
        for i in range(0, best_height, self.image_size):
            for j in range(0, best_width, self.image_size):
                images_list.append(self.image_transform(local_view.crop((j, i, j+self.image_size, i+self.image_size))))
        # record crop numbers and add image tokens (omitted for brevity)
    # process last text split, add BOS/EOS, etc.
    return (tokenized_str, images_list, images_seq_mask, images_spatial_crop, num_image_tokens)
class DeepseekV2Attention(paddle.nn.Layer):
    """Multi‑headed attention from 'Attention Is All You Need'."""
    def __init__(self, config, layer_idx=None):
        super().__init__()
        # initialise q_proj, kv_a_proj_with_mqa, o_proj, RoPE, etc.
    def forward(self, hidden_states, attention_mask=None, position_ids=None,
                past_key_value=None, output_attentions=False, use_cache=False, **kwargs):
        # compute queries, keys, values, apply RoPE, calculate attention scores,
        # apply softmax, dropout, output projection, handle caching
        return attn_output, attn_weights, past_key_value
class DeepseekV2MoE(paddle.nn.Layer):
    def __init__(self, config):
        super().__init__()
        self.experts = nn.ModuleList([DeepseekV2MLP(config, intermediate_size=config.moe_intermediate_size)
                                      for _ in range(config.n_routed_experts)])
        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            self.shared_experts = DeepseekV2MLP(config, intermediate_size=config.moe_intermediate_size * config.n_shared_experts)
    def forward(self, hidden_states):
        identity = hidden_states
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        # route tokens to selected experts, combine outputs, add shared experts if present
        return y
class MoEGate(paddle.nn.Layer):
    def __init__(self, config):
        super().__init__()
        self.weight = paddle.create_parameter(shape=[config.hidden_size, config.n_routed_experts], dtype='float32')
        self.top_k = config.num_experts_per_tok
        self.scoring_func = config.scoring_func
        self.routed_scaling_factor = config.routed_scaling_factor
        self.norm_topk_prob = config.norm_topk_prob
        self.alpha = config.aux_loss_alpha
    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape
        logits = paddle.nn.functional.linear(hidden_states.reshape([-1, h]).astype('float32'),
                                              self.weight.astype('float32'))
        if self.scoring_func == "softmax":
            scores = paddle.nn.functional.softmax(logits, axis=-1)
        elif self.scoring_func == "sigmoid":
            scores = logits.sigmoid()
        else:
            raise NotImplementedError(f"Unsupported scoring function: {self.scoring_func}")
        topk_weight, topk_idx = paddle.topk(scores, k=self.top_k, sorted=False, axis=-1)
        if self.top_k > 1 and self.norm_topk_prob:
            denominator = topk_weight.sum(axis=-1, keepdim=True) + 1e-20
            topk_weight = topk_weight / denominator * self.routed_scaling_factor
        else:
            topk_weight = topk_weight * self.routed_scaling_factor
        aux_loss = None
        if self.training and self.alpha > 0.0:
            # compute auxiliary loss for balanced expert usage (omitted for brevity)
            aux_loss = ...
        return topk_idx, topk_weight, aux_loss

To try the model, users can clone the PaddleMIX repository, install PaddlePaddle (GPU version), install dependencies, and run the provided inference script. An example command runs DeepSeek‑VL2‑tiny on three images and a question, producing a natural‑language answer.

Links to the papers, code repository, and AI Studio tutorial are provided for further exploration.

Mixture of ExpertsVision-LanguageInferenceCodeDeepSeek-VL2multimodal modelPaddleMIXtraining
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.