Artificial Intelligence 32 min read

Large Language Model Upgrade Paths and Architecture Selection

This article analyzes upgrade paths of major LLMs—ChatGLM, LLaMA, Baichuan—detailing performance, context length, and architectural changes, then examines essential capabilities, data cleaning, tokenizer and attention design, and offers practical guidance for balanced scaling and efficient model construction.

DaTaobao Tech

Sep 11, 2023

Large Language Model Upgrade Paths and Architecture Selection

This article provides a comprehensive analysis of the upgrade paths of large language models (LLMs) including ChatGLM, LLAMA, and Baichuan, along with an in-depth exploration of LLM architecture selection. The content systematically examines the key elements of large-scale pre-trained models, offering valuable insights for building more powerful, flexible, and efficient models in practical engineering applications.

The article begins by analyzing the upgrade trajectories of three major models. For ChatGLM, it details the evolution from ChatGLM-6B to ChatGLM2-6B, highlighting improvements in performance (20-30% gains across benchmarks), longer context lengths (2K to 32K), enhanced inference efficiency through Multi-Query Attention, and more open licensing. The technical upgrades include structural changes from Prefix-LM to pure Decoder-Only architecture, sequence length extensions, and operator optimizations like Flash Attention.

For LLAMA, the analysis covers the transition from LLaMA to LLaMA2, noting improvements in training data (1.4T to 2T tokens), sequence length (2K to 4K), and the introduction of Grouped-Query Attention for better scalability. The article details the three-stage training process: pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), with particular emphasis on the quality of SFT data collection and the dual-reward model approach in RLHF.

The Baichuan analysis focuses on the progression from Baichuan-7B to Baichuan-13B, emphasizing parameter scaling (doubling), increased training data (1.2T to 1.4T tokens), and architectural improvements like switching from RoPE to ALiBi positional encoding. The article highlights Baichuan's focus on efficient tokenization for Chinese text and the release of both pre-trained and aligned models.

The article then transitions to discussing the essential capabilities required for large models, including foundational knowledge enhancement through parameter scaling and data quality improvement, sequence length extension via position encoding design, and model structure optimization. It provides detailed guidance on data cleaning techniques, including filtering invalid data, document length filtering, machine-generated content removal, deduplication, contamination control, toxicity and bias management, and personal information protection.

Key architectural considerations are thoroughly examined, including tokenizer design (with emphasis on multilingual and domain-specific optimization), LayerNorm variants (Pre-LN vs Post-LN), activation functions (ReLU, GELU, SwiGLU, GeGLU), attention mechanisms (Flash Attention, Multi-Query Attention), and positional encoding methods (RoPE, ALiBi). The article provides mathematical formulations and implementation details for these components.

The article concludes with practical recommendations for building effective base models, emphasizing the importance of balanced scaling between model parameters, training data, and computational resources. It provides a comprehensive framework for understanding and implementing large-scale pre-trained models, making it an invaluable resource for researchers and practitioners in the field of artificial intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization Large Language Models LLaMA attention mechanisms Baichuan ChatGLM Data preprocessing LLM architecture Model Scaling Positional Encoding

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.