Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview
This survey acts as a comprehensive portal that organizes AIGC research across seven domains—text, image, and audio generation, cross‑modal association, text‑guided image and audio synthesis, and supporting resources—detailing seminal models such as GPT, Diffusion, CLIP, DALL·E, Stable Diffusion, MusicLM, and key papers that shaped each field.
This article serves as a comprehensive "portal" document for organizing and guiding researchers through the AIGC (AI-Generated Content) landscape. It covers seven major areas of multimodal AI research:
1. Single Modality: Text Recognition and Generation
Focuses primarily on the GPT family models, including GPT-1, GPT-2, GPT-3, and InstructGPT. Key papers covered include Efficient Training of Language Models to Fill in the Middle, Text and Code Embeddings by Contrastive Pre-Training, WebGPT, Training Verifiers to Solve Math Word Problems, Codex (Evaluating Large Language Models Trained on Code), and Learning to Summarize with Human Feedback.
2. Single Modality: Image Recognition and Generation
Covers the transition from GAN to Diffusion models. Key architectures include ResNet, Sparse Transformers, MoCo V1/V2/V3, ViT (Vision Transformer), MAE (Masked Autoencoders), VAE, VQ-VAE, VQ-VAE-2, VideoGPT, U-Net, DDPM, Improved DDPM, and GLIDE. The article explains that image generation models follow an "image feature extractor + generator" paradigm.
3. Single Modality: Audio Recognition and Generation
Highlights Whisper for speech recognition (trained on 680K hours of speech-text pairs with zero-shot capability) and Jukebox for music generation using VQ-VAE. Other papers include Conformer, wav2vec, wav2vec 2.0, and SingSong.
4. Cross-modal Association
Centers on CLIP's approach (image-text pairing + contrastive learning). Papers include CLAP (audio-text), ViLT, L-Seg, GroupViT, ViLD, GLIP, CLIPasso, CLIP4Clip, ActionCLIP, AudioCLIP, PointCLIP, and research on multimodal neurons.
5. Cross-modal: Text-guided Image Generation
Covers the evolution from DALL·E (VQ-VAE2 + GPT) to DALL·E V2 (GLIDE-based with CLIP guidance) to Stable Diffusion/Latent Diffusion. Other models include NÜWA, ERNIE-ViLG, CogView, CogView2, CogVideo, Imagen, and Imagen Video.
6. Cross-modal: Text-guided Audio Generation
Features MusicLM and includes AudioLDM, Moûsai, and neural codec language models for TTS.
7. Additional Resources
Mentions OpenAI Microscope for visualizing model internals and lucidrains' GitHub repositories for quality implementations.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.