Artificial Intelligence 5 min read

Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

This article introduces a 1.5‑hour tutorial presented by Tsinghua researchers at IJCAI 2025, covering the latest advances in multimodal generative AI, including multimodal large language models, diffusion models, post‑training generalization techniques, and unified understanding‑generation frameworks.

Data Party THU

Sep 3, 2025

Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

Overview

The tutorial presents recent research progress in multimodal generative artificial intelligence, concentrating on two dominant technology streams: (1) multimodal large language models (MLLMs) for multimodal understanding and (2) diffusion models for visual generation. It systematically covers probabilistic modeling methods, model architectures, and multimodal interaction mechanisms.

Delivered at IJCAI 2025 (Montreal, 16‑22 August). Tutorial page: https://mn.cs.tsinghua.edu.cn/ijcai25-aigc/

Tutorial Outline (1.5 hours)

Part 1 – Introduction to Generative Models (5 min)

New paradigm of large models

Application domains of multimodal generative AI

Two model families: multimodal LLMs and diffusion models

Part 2 – Multimodal Large Language Models (10 min)

Autoregressive modeling

Vision‑language pre‑training

Visual tokenizers

Part 3 – Diffusion Models (10 min)

Denoising diffusion probabilistic models

Latent‑space diffusion

Flow matching

Text‑to‑image and text‑to‑video applications

Part 4 – Post‑Training for New‑Concept Generalization (35 min)

Addresses challenges in dynamic, open environments such as shifting data distributions, emerging concepts, and complex scenarios. Proposes post‑training techniques to improve model adaptability.

Spatial‑decoupled post‑training

Spatio‑temporal decoupled post‑training

Part 5 – Unified Understanding‑Generation Models (15 min)

Probabilistic modeling process for joint understanding and generation

Unified model architecture supporting both tasks

Part 6 – Future Directions (10 min)

Physics‑aware generative AI

Integrated benchmarks for understanding and generation

Multimodal image‑generation AI

Embodied generative AI

Part 7 – Open Discussion (5 min)

Target audience: AI researchers interested in multimodal generative models, multimodal LLMs, and diffusion models. Participants will gain a solid grasp of recent probabilistic modeling methods, architectural designs, and emerging applications.

Code example

来源：专知
本文
约1000字
，建议阅读
5
分钟
来自清华大学研究人员给出《动态开放环境下的多模态生成式人工智能》教程，值得关注！

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI large language models Tutorial diffusion models generative models IJCAI 2025

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Tutorial Outline (1.5 hours)

Part 1 – Introduction to Generative Models (5 min)

Part 2 – Multimodal Large Language Models (10 min)

Part 3 – Diffusion Models (10 min)

Part 4 – Post‑Training for New‑Concept Generalization (35 min)

Part 5 – Unified Understanding‑Generation Models (15 min)

Part 6 – Future Directions (10 min)

Part 7 – Open Discussion (5 min)

Code example

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Tutorial Outline (1.5 hours)

Part 1 – Introduction to Generative Models (5 min)

Part 2 – Multimodal Large Language Models (10 min)

Part 3 – Diffusion Models (10 min)

Part 4 – Post‑Training for New‑Concept Generalization (35 min)

Part 5 – Unified Understanding‑Generation Models (15 min)

Part 6 – Future Directions (10 min)

Part 7 – Open Discussion (5 min)