BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge
The BARD-VL framework bridges pretrained autoregressive vision‑language models to diffusion‑based VLMs, preserving or surpassing original performance while boosting decoding throughput up to three times, through progressive block merging, stage‑wise diffusion distillation, and engineering optimizations validated on multiple benchmarks.
AR VLM bottlenecks and Diffusion challenges
Autoregressive (AR) vision‑language models (VLMs) achieve strong performance on visual QA, document understanding, and multimodal agents, but token‑by‑token decoding causes computational cost and latency to grow with output length. Directly converting a state‑of‑the‑art AR VLM into a diffusion VLM leads to a "supervision mismatch": AR models predict the next token under clean causal prefixes, whereas diffusion models denoise perturbed tokens, resulting in notable capability degradation.
BARD core mechanisms
Progressive Supervised Block Merging
BARD starts from a pretrained AR model and builds a small‑block diffusion anchor. It follows a block‑size schedule of 4 → 8 → 16 → 32, progressively enlarging the parallel decoding granularity. At each stage the model only learns to merge adjacent prediction blocks, avoiding a sudden jump to large‑scale parallel decoding and reducing learning difficulty.
Stage‑wise Diffusion Distillation
Instead of using the original AR model as a teacher, BARD employs the diffusion anchor from the previous stage for supervision. Because both teacher and student operate under diffusion mechanisms, the supervision signal aligns better. Experiments show that with block size 32, diffusion distillation improves MMMU, RealWorldQA, and MMMU‑Pro scores far beyond traditional AR‑based distillation.
Engineering optimizations
Mixed‑noise Scheduler : Extends mask‑diffusion by adding uniform corruption to visible tokens, enabling simultaneous learning of completion and correction, which improves robustness on complex scenes.
Memory‑friendly training layout : Packs clean responses and noisy responses into a single sequence with a custom attention mask, drastically reducing GPU memory consumption for long multimodal sequences.
Experimental results
Capability comparison
The team cleaned and combined 4.4 M high‑quality samples from LLaVA‑OneVision‑1.5 and FineVision. On seven core benchmarks, BARD‑VL 4B outperformed Qwen3‑VL 4B on five metrics (MMMU +5.1, MME +8, RealWorldQA +1.4, MMStar +6.7, AI2D +1.8) and matched it on the remaining two. At 8B, BARD‑VL improved six metrics (MMMU +1.6, MMMU‑Pro +1.6, MME +14, RealWorldQA +1.2, MMStar +5.1, ChartQA +0.6). Compared with open‑source diffusion VLMs, BARD‑VL 8B surpassed LLaDA‑V 8B on all seven benchmarks, and the 4B version beat Dimple‑VL on all metrics.
Inference efficiency analysis
OCRBench curves show BARD‑VL 4B maintains higher accuracy across a wide decoding‑throughput range. In a document‑structured information extraction example, BARD‑VL required only six diffusion refinements versus 35 AR steps for the original Qwen3‑VL, demonstrating practical speed‑up for long‑output tasks.
Paper: https://arxiv.org/pdf/2604.16514
Code repository: https://github.com/fudan-generative-vision/Bard-VL.git
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
