Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

This article introduces a 1.5‑hour tutorial presented by Tsinghua researchers at IJCAI 2025, covering the latest advances in multimodal generative AI, including multimodal large language models, diffusion models, post‑training generalization techniques, and unified understanding‑generation frameworks.

Data Party THU
Data Party THU
Data Party THU
Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

Overview

The tutorial presents recent research progress in multimodal generative artificial intelligence, concentrating on two dominant technology streams: (1) multimodal large language models (MLLMs) for multimodal understanding and (2) diffusion models for visual generation. It systematically covers probabilistic modeling methods, model architectures, and multimodal interaction mechanisms.

Delivered at IJCAI 2025 (Montreal, 16‑22 August). Tutorial page: https://mn.cs.tsinghua.edu.cn/ijcai25-aigc/

Tutorial Outline (1.5 hours)

Part 1 – Introduction to Generative Models (5 min)

New paradigm of large models

Application domains of multimodal generative AI

Two model families: multimodal LLMs and diffusion models

Part 2 – Multimodal Large Language Models (10 min)

Autoregressive modeling

Vision‑language pre‑training

Visual tokenizers

Part 3 – Diffusion Models (10 min)

Denoising diffusion probabilistic models

Latent‑space diffusion

Flow matching

Text‑to‑image and text‑to‑video applications

Part 4 – Post‑Training for New‑Concept Generalization (35 min)

Addresses challenges in dynamic, open environments such as shifting data distributions, emerging concepts, and complex scenarios. Proposes post‑training techniques to improve model adaptability.

Spatial‑decoupled post‑training

Spatio‑temporal decoupled post‑training

Part 5 – Unified Understanding‑Generation Models (15 min)

Probabilistic modeling process for joint understanding and generation

Unified model architecture supporting both tasks

Part 6 – Future Directions (10 min)

Physics‑aware generative AI

Integrated benchmarks for understanding and generation

Multimodal image‑generation AI

Embodied generative AI

Part 7 – Open Discussion (5 min)

Target audience: AI researchers interested in multimodal generative models, multimodal LLMs, and diffusion models. Participants will gain a solid grasp of recent probabilistic modeling methods, architectural designs, and emerging applications.

Code example

来源:专知
本文
约1000字
,建议阅读
5
分钟
来自清华大学研究人员给出《动态开放环境下的多模态生成式人工智能》教程,值得关注!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AIlarge language modelsTutorialdiffusion modelsgenerative modelsIJCAI 2025
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.