Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025
This article introduces a 1.5‑hour tutorial presented by Tsinghua researchers at IJCAI 2025, covering the latest advances in multimodal generative AI, including multimodal large language models, diffusion models, post‑training generalization techniques, and unified understanding‑generation frameworks.
Overview
The tutorial presents recent research progress in multimodal generative artificial intelligence, concentrating on two dominant technology streams: (1) multimodal large language models (MLLMs) for multimodal understanding and (2) diffusion models for visual generation. It systematically covers probabilistic modeling methods, model architectures, and multimodal interaction mechanisms.
Delivered at IJCAI 2025 (Montreal, 16‑22 August). Tutorial page: https://mn.cs.tsinghua.edu.cn/ijcai25-aigc/
Tutorial Outline (1.5 hours)
Part 1 – Introduction to Generative Models (5 min)
New paradigm of large models
Application domains of multimodal generative AI
Two model families: multimodal LLMs and diffusion models
Part 2 – Multimodal Large Language Models (10 min)
Autoregressive modeling
Vision‑language pre‑training
Visual tokenizers
Part 3 – Diffusion Models (10 min)
Denoising diffusion probabilistic models
Latent‑space diffusion
Flow matching
Text‑to‑image and text‑to‑video applications
Part 4 – Post‑Training for New‑Concept Generalization (35 min)
Addresses challenges in dynamic, open environments such as shifting data distributions, emerging concepts, and complex scenarios. Proposes post‑training techniques to improve model adaptability.
Spatial‑decoupled post‑training
Spatio‑temporal decoupled post‑training
Part 5 – Unified Understanding‑Generation Models (15 min)
Probabilistic modeling process for joint understanding and generation
Unified model architecture supporting both tasks
Part 6 – Future Directions (10 min)
Physics‑aware generative AI
Integrated benchmarks for understanding and generation
Multimodal image‑generation AI
Embodied generative AI
Part 7 – Open Discussion (5 min)
Target audience: AI researchers interested in multimodal generative models, multimodal LLMs, and diffusion models. Participants will gain a solid grasp of recent probabilistic modeling methods, architectural designs, and emerging applications.
Code example
来源:专知
本文
约1000字
,建议阅读
5
分钟
来自清华大学研究人员给出《动态开放环境下的多模态生成式人工智能》教程,值得关注!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
