23 min read

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

Baobao Algorithm Notes

Dec 24, 2023

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

Interesting AI Agents

Generative Agents: Interactive Simulacra of Human Behavior – https://arxiv.org/abs/2304.03442 RoleLLM: Benchmarking, Eliciting, and Enhancing Role‑Playing Abilities of Large Language Models – https://arxiv.org/abs/2310.00746 Role play with large language models – https://www.nature.com/articles/s41586-023-06647-8 Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf – https://arxiv.org/abs/2309.04658 MemGPT: Towards LLMs as Operating Systems – https://arxiv.org/abs/2310.08560 Augmenting Language Models with Long‑Term Memory – https://arxiv.org/abs/2306.07174 Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models –

https://arxiv.org/pdf/2307.16180.pdf

Useful AI Agents

The Rise and Potential of Large Language Model‑Based Agents: A Survey – https://arxiv.org/abs/2309.07864 MetaGPT: Meta Programming for A Multi‑Agent Collaborative Framework – https://arxiv.org/abs/2308.00352 Communicative Agents for Software Development – https://arxiv.org/pdf/2307.07924.pdf Large Language Models Can Self‑Improve – https://arxiv.org/abs/2210.11610 Evaluating Human‑Language Model Interaction – https://arxiv.org/abs/2212.09746 Large Language Models can Learn Rules – https://arxiv.org/abs/2310.07064 AgentBench: Evaluating LLMs as Agents – https://arxiv.org/abs/2308.03688 WebArena: A Realistic Web Environment for Building Autonomous Agents – https://arxiv.org/abs/2307.13854 TableGPT: Towards Unifying Tables, Natural Language and Commands into One GPT –

https://arxiv.org/abs/2307.08674

Task Planning and Decomposition

Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models – https://arxiv.org/abs/2201.11903 Tree of Thoughts: Deliberate Problem Solving with Large Language Models – https://arxiv.org/abs/2305.10601 Implicit Chain of Thought Reasoning via Knowledge Distillation – https://arxiv.org/abs/2311.01460 ReAct: Synergizing Reasoning and Acting in Language Models – https://arxiv.org/abs/2210.03629 ART: Automatic Multi‑step Reasoning and Tool‑use for Large Language Models – https://arxiv.org/abs/2303.09014 Branch‑Solve‑Merge Improves Large Language Model Evaluation and Generation – https://arxiv.org/abs/2310.15123 WizardLM: Empowering Large Language Models to Follow Complex Instructions –

https://arxiv.org/pdf/2304.12244.pdf

Hallucination

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models – https://arxiv.org/pdf/2309.01219.pdf Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback – https://arxiv.org/abs/2302.12813 SelfCheckGPT: Zero‑Resource Black‑Box Hallucination Detection for Generative Large Language Models – https://arxiv.org/abs/2303.08896 WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus –

https://arxiv.org/abs/2304.04358

Multimodal Learning

Learning Transferable Visual Models From Natural Language Supervision (CLIP) – https://arxiv.org/abs/2103.00020 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) – https://arxiv.org/abs/2010.11929 MiniGPT‑v2: Large Language Model as a Unified Interface for Vision‑Language Multi‑task Learning – https://arxiv.org/abs/2310.09478 MiniGPT‑4: Enhancing Vision‑Language Understanding with Advanced Large Language Models – https://arxiv.org/abs/2304.10592 NExT‑GPT: Any‑to‑Any Multimodal LLM – https://arxiv.org/pdf/2309.05519.pdf Visual Instruction Tuning (LLaVA) – https://arxiv.org/pdf/2304.08485.pdf LLaVA‑1.5: Improved Baselines with Visual Instruction Tuning – https://arxiv.org/abs/2310.03744 Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM) – https://arxiv.org/pdf/2312.00785.pdf CoDi‑2: In‑Context, Interleaved, and Interactive Any‑to‑Any Generation – https://arxiv.org/pdf/2311.18775.pdf Neural Discrete Representation Learning (VQ‑VAE) – https://browse.arxiv.org/pdf/1711.00937.pdf Taming Transformers for High‑Resolution Image Synthesis (VQ‑GAN) – https://arxiv.org/abs/2012.09841 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows – https://arxiv.org/abs/2103.14030 BLIP‑2: Bootstrapping Language‑Image Pre‑training with Frozen Image Encoders and Large Language Models – https://browse.arxiv.org/pdf/2301.12597.pdf InstructBLIP: Towards General‑purpose Vision‑Language Models with Instruction Tuning – https://browse.arxiv.org/pdf/2305.06500.pdf ImageBind: One Embedding Space To Bind Them All – https://arxiv.org/abs/2305.05665 Meta‑Transformer: A Unified Framework for Multimodal Learning –

https://arxiv.org/abs/2307.10802

Image/Video Generation

High‑Resolution Image Synthesis with Latent Diffusion Models – https://arxiv.org/pdf/2112.10752.pdf Structure and Content‑Guided Video Synthesis with Diffusion Models (RunwayML Gen1) – https://browse.arxiv.org/pdf/2302.03011.pdf Hierarchical Text‑Conditional Image Generation with CLIP Latents (DaLLE‑2) – https://arxiv.org/pdf/2204.06125.pdf AnimateDiff: Animate Your Personalized Text‑to‑Image Diffusion Models without Specific Tuning – https://arxiv.org/abs/2307.04725 ControlNet: Adding Conditional Control to Text‑to‑Image Diffusion Models – https://arxiv.org/abs/2302.05543 SDXL: Improving Latent Diffusion Models for High‑Resolution Image Synthesis – https://arxiv.org/abs/2307.01952 Zero‑1‑to‑3: Zero‑shot One Image to 3D Object – https://arxiv.org/abs/2303.11328 Scaling Vision Transformers to 22 Billion Parameters – https://arxiv.org/abs/2302.05442 Glow: Generative Flow with Invertible 1×1 Convolutions – https://browse.arxiv.org/pdf/1807.03039.pdf Language Model Beats Diffusion – Tokenizer is Key to Visual Generation – https://arxiv.org/pdf/2310.05737.pdf InstaFlow: One Step is Enough for High‑Quality Diffusion‑Based Text‑to‑Image Generation – https://arxiv.org/pdf/2309.06380.pdf Perceptual Losses for Real‑Time Style Transfer and Super‑Resolution – https://arxiv.org/pdf/1603.08155.pdf CogView: Mastering Text‑to‑Image Generation via Transformers – https://arxiv.org/abs/2105.13290 Diffusion Models for Video Prediction and Infilling –

https://arxiv.org/abs/2206.07696

Foundational Large‑Model Papers

Attention Is All You Need – https://arxiv.org/abs/1706.03762 Sequence to Sequence Learning with Neural Networks – https://arxiv.org/abs/1409.3215 Neural Machine Translation by Jointly Learning to Align and Translate – https://arxiv.org/abs/1409.0473 BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding – https://arxiv.org/abs/1810.04805 Scaling Laws for Neural Language Models – https://arxiv.org/pdf/2001.08361.pdf Emergent Abilities of Large Language Models – https://openreview.net/pdf?id=yzkSU5zdwD Training Compute‑Optimal Large Language Models (Chinchilla scaling law) – https://arxiv.org/abs/2203.15556 Scaling Instruction‑Fine‑tuned Language Models – https://arxiv.org/pdf/2210.11416.pdf Direct Preference Optimization: Your Language Model is Secretly a Reward Model – https://arxiv.org/pdf/2305.18290.pdf Progress Measures for Grokking via Mechanistic Interpretability – https://arxiv.org/abs/2301.05217 Language Models Represent Space and Time – https://arxiv.org/abs/2310.02207 GLaM: Efficient Scaling of Language Models with Mixture‑of‑Experts – https://arxiv.org/abs/2112.06905 Adam: A Method for Stochastic Optimization – https://arxiv.org/abs/1412.6980 Word2Vec: Efficient Estimation of Word Representations in Vector Space – https://arxiv.org/abs/1301.3781 Distributed Representations of Words and Phrases and Their Compositionality –

https://arxiv.org/abs/1310.4546

GPT Series

Language Models are Few‑Shot Learners (GPT‑3) – https://arxiv.org/abs/2005.14165 Language Models are Unsupervised Multitask Learners (GPT‑2) –

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Improving Language Understanding by Generative Pre‑Training (GPT‑1) –

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Training language models to follow instructions with human feedback (InstructGPT) – https://arxiv.org/pdf/2203.02155.pdf Evaluating Large Language Models Trained on Code – https://arxiv.org/pdf/2107.03374.pdf Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond – https://arxiv.org/abs/2304.13712 Instruction Tuning with GPT‑4 – https://arxiv.org/pdf/2304.03277.pdf The Dawn of LMMs: Preliminary Explorations with GPT‑4V (Vision) – https://arxiv.org/abs/2309.17421 Sparks of Artificial General Intelligence: Early Experiments with GPT‑4 – https://arxiv.org/abs/2303.12712 Weak‑to‑Strong Generalization: Eliciting Strong Capabilities With Weak Supervision –

https://arxiv.org/abs/2312.09390

Open‑Source Large Models

LLaMA: Open and Efficient Foundation Language Models – https://arxiv.org/abs/2302.13971 Llama 2: Open Foundation and Fine‑Tuned Chat Models – https://arxiv.org/pdf/2307.09288.pdf Vicuna: An Open‑Source Chatbot Impressing GPT‑4 with 90% ChatGPT Quality – https://lmsys.org/blog/2023-03-30-vicuna/ LMSYS‑Chat‑1M: A Large‑Scale Real‑World LLM Conversation Dataset – https://arxiv.org/abs/2309.11998 Judging LLM‑as‑a‑Judge with MT‑Bench and Chatbot Arena – https://arxiv.org/abs/2306.05685 How Long Can Open‑Source LLMs Truly Promise on Context Length? – https://lmsys.org/blog/2023-06-29-longchat/ Mixtral of Experts – https://mistral.ai/news/mixtral-of-experts/ OpenChat: Advancing Open‑source Language Models with Mixed‑Quality Data – https://arxiv.org/abs/2309.11235 RWKV: Reinventing RNNs for the Transformer Era – https://arxiv.org/abs/2305.13048 Mamba: Linear‑Time Sequence Modeling with Selective State Spaces – https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf Retentive Network: A Successor to Transformer for Large Language Models – https://arxiv.org/abs/2307.08621 Baichuan 2: Open Large‑scale Language Models – https://arxiv.org/abs/2309.10305 GLM‑130B: An Open Bilingual Pre‑trained Model – https://arxiv.org/abs/2210.02414 Qwen Technical Report – https://arxiv.org/abs/2309.16609 Skywork: A More Open Bilingual Foundation Model –

https://arxiv.org/abs/2310.19341

Fine‑Tuning Techniques

Learning to Summarize from Human Feedback – https://arxiv.org/abs/2009.01325 Self‑Instruct: Aligning Language Model with Self‑Generated Instruction – https://arxiv.org/abs/2212.10560 Scaling Down to Scale Up: A Guide to Parameter‑Efficient Fine‑Tuning – https://arxiv.org/abs/2303.15647 LoRA: Low‑Rank Adaptation of Large Language Models – https://arxiv.org/abs/2106.09685 Vera: Vector‑Based Random Matrix Adaptation – https://arxiv.org/pdf/2310.11454.pdf QLoRA: Efficient Finetuning of Quantized LLMs – https://arxiv.org/abs/2305.14314 Chain of Hindsight Aligns Language Models with Feedback – https://arxiv.org/abs/2302.02676 Beyond Human Data: Scaling Self‑Training for Problem‑Solving with Language Models –

https://arxiv.org/pdf/2312.06585.pdf

Performance Optimization

Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) – https://arxiv.org/abs/2309.06180 FlashAttention: Fast and Memory‑Efficient Exact Attention with IO‑Awareness – https://arxiv.org/abs/2205.14135 S‑LoRA: Serving Thousands of Concurrent LoRA Adapters – https://arxiv.org/abs/2311.03285 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism –

https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf

Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism – https://arxiv.org/pdf/1909.08053.pdf ZeRO: Memory Optimizations Toward Training Trillion Parameter Models –