Artificial Intelligence 13 min read

Controllable Mind Visual Diffusion Model (CMVDM) for Reconstructing Visual Stimuli from fMRI Signals

The Controllable Mind Visual Diffusion Model (CMVDM) decodes fMRI signals into semantic vectors and silhouette maps, feeds them into a latent diffusion framework with a ControlNet‑style encoder, and reconstructs high‑fidelity images that surpass existing baselines in both structural similarity and semantic accuracy across multiple brain‑imaging datasets.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Controllable Mind Visual Diffusion Model (CMVDM) for Reconstructing Visual Stimuli from fMRI Signals

Recent research has focused on decoding visual information from brain activity and reconstructing the original images. A CVPR‑accepted paper demonstrated that diffusion models can generate images directly from fMRI signals, effectively “reading the mind”.

The paper presented at AAAI 2024 by the Xiaohongshu multimodal team introduces the Controllable Mind Visual Diffusion Model (CMVDM). CMVDM combines fMRI‑derived semantic and contour information with a latent diffusion model (LDM) to produce high‑quality images that align with the semantics and spatial structure of the original visual stimulus.

CMVDM works in three stages:

Semantic extraction: a pretrained fMRI feature extractor is fine‑tuned on the HCP dataset, and a semantic alignment loss aligns extracted features with CLIP image embeddings.

Silhouette extraction: a symmetric encoder‑decoder network predicts contour (silhouette) maps from fMRI signals, supervised by SSIM and MAE losses.

Control network: inspired by ControlNet, a cloned U‑Net encoder receives noise latents, semantic vectors, and silhouette maps; a residual block injects additional fMRI information not captured by semantics or contours.

The model is evaluated on two datasets—Generic Objects Dataset (GOD) and BOLD5000—against four state‑of‑the‑art baselines (Beliy, Gaziv, IC‑GAN, MinD‑Vis). Metrics include classification accuracy (Acc), Pearson correlation coefficient (PCC), and structural similarity index (SSIM). CMVDM consistently outperforms baselines, especially in SSIM, indicating superior preservation of object outlines.

Ablation studies confirm that the semantic alignment loss, silhouette module, and residual block each contribute to improved semantic accuracy and structural similarity. Further analysis shows that low‑level visual cortex (LVC) signals yield higher SSIM, while high‑level visual cortex (HVC) signals provide better semantic accuracy, demonstrating distinct roles of visual areas in the reconstruction task.

The authors conclude that CMVDM effectively decomposes brain‑signal‑to‑image reconstruction into feature extraction and image synthesis, achieving SOTA performance on multiple benchmarks and offering a versatile diffusion‑based framework for controllable generation from neural data.

Paper: https://arxiv.org/pdf/2305.10135.pdf

AIdiffusion modelbrain decodingfMRIneurosciencevisual reconstruction
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.