Artificial Intelligence 9 min read

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

The BARD-VL framework bridges pretrained autoregressive vision‑language models to diffusion‑based VLMs, preserving or surpassing original performance while boosting decoding throughput up to three times, through progressive block merging, stage‑wise diffusion distillation, and engineering optimizations validated on multiple benchmarks.

Machine Heart

May 9, 2026

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

AR VLM bottlenecks and Diffusion challenges

Autoregressive (AR) vision‑language models (VLMs) achieve strong performance on visual QA, document understanding, and multimodal agents, but token‑by‑token decoding causes computational cost and latency to grow with output length. Directly converting a state‑of‑the‑art AR VLM into a diffusion VLM leads to a "supervision mismatch": AR models predict the next token under clean causal prefixes, whereas diffusion models denoise perturbed tokens, resulting in notable capability degradation.

BARD core mechanisms

Progressive Supervised Block Merging

BARD starts from a pretrained AR model and builds a small‑block diffusion anchor. It follows a block‑size schedule of 4 → 8 → 16 → 32, progressively enlarging the parallel decoding granularity. At each stage the model only learns to merge adjacent prediction blocks, avoiding a sudden jump to large‑scale parallel decoding and reducing learning difficulty.

Stage‑wise Diffusion Distillation

Instead of using the original AR model as a teacher, BARD employs the diffusion anchor from the previous stage for supervision. Because both teacher and student operate under diffusion mechanisms, the supervision signal aligns better. Experiments show that with block size 32, diffusion distillation improves MMMU, RealWorldQA, and MMMU‑Pro scores far beyond traditional AR‑based distillation.

Engineering optimizations

Mixed‑noise Scheduler : Extends mask‑diffusion by adding uniform corruption to visible tokens, enabling simultaneous learning of completion and correction, which improves robustness on complex scenes.

Memory‑friendly training layout : Packs clean responses and noisy responses into a single sequence with a custom attention mask, drastically reducing GPU memory consumption for long multimodal sequences.

Experimental results

Capability comparison

The team cleaned and combined 4.4 M high‑quality samples from LLaVA‑OneVision‑1.5 and FineVision. On seven core benchmarks, BARD‑VL 4B outperformed Qwen3‑VL 4B on five metrics (MMMU +5.1, MME +8, RealWorldQA +1.4, MMStar +6.7, AI2D +1.8) and matched it on the remaining two. At 8B, BARD‑VL improved six metrics (MMMU +1.6, MMMU‑Pro +1.6, MME +14, RealWorldQA +1.2, MMStar +5.1, ChartQA +0.6). Compared with open‑source diffusion VLMs, BARD‑VL 8B surpassed LLaDA‑V 8B on all seven benchmarks, and the 4B version beat Dimple‑VL on all metrics.

Inference efficiency analysis

OCRBench curves show BARD‑VL 4B maintains higher accuracy across a wide decoding‑throughput range. In a document‑structured information extraction example, BARD‑VL required only six diffusion refinements versus 35 AR steps for the original Qwen3‑VL, demonstrating practical speed‑up for long‑output tasks.

Paper: https://arxiv.org/pdf/2604.16514

Code repository: https://github.com/fudan-generative-vision/Bard-VL.git

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

efficiency benchmark multimodal diffusion vision-language BARD-VL

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.