Artificial Intelligence 12 min read

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

HiFT introduces a layer‑wise hierarchical fine‑tuning strategy that freezes most parameters per step, reduces optimizer state memory, and adapts mixed‑precision training, enabling 7B and 13B models to be fine‑tuned on 16‑31 GB GPUs while maintaining competitive performance.

NewBeeNLP

Feb 5, 2024

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

Background

Before large language models (LLMs), full‑parameter fine‑tuning was the default for adapting language models to downstream tasks. The emergence of LLMs made full‑parameter fine‑tuning prohibitively memory‑intensive, leading to the rise of Parameter‑Efficient Fine‑tuning (PEFT) methods such as LoRA, which achieve comparable performance with lower GPU memory but still lag behind full‑parameter results.

Recent work has explored memory‑efficient full‑parameter fine‑tuning by using zero‑momentum optimizers (e.g., MeZO, LOMO) that eliminate optimizer state memory. While MeZO can fine‑tune a 30B model on an 80 GB device, it often sacrifices stability compared to AdamW, and LOMO requires double forward passes.

Where Memory Is Consumed

During fine‑tuning, GPU memory is dominated by four components: model parameters, gradients, optimizer states, and the remaining activation/intermediate buffers. Model parameters must reside on the GPU for forward passes. Optimizer state size depends on momentum order (AdamW uses second‑order momentum, doubling parameter size; SGD uses none). Activation memory grows with sequence length and batch size.

Figure 1 illustrates the HiFT strategy: layers are grouped (K groups), and three training orders—bottom‑to‑up, top‑to‑down, and random—determine which group is active per step. Frozen groups keep parameters and optimizer states off‑GPU.

How HiFT Reduces Memory

HiFT partitions the model into K groups (K ≤ number of layers). In each training step, only one group is updated while all other groups remain frozen. This limits gradient computation to a single group, and only the optimizer states for that group need to be stored on the GPU; the rest stay in CPU memory. Consequently, the peak memory footprint equals the size of the largest layer plus its gradients and optimizer states.

Learning‑rate updates are delayed until all groups have been updated once, preventing large learning‑rate jumps that could destabilize training.

Figure 2 shows GPU memory usage when fine‑tuning LLaMA‑2‑7B on the E2E dataset (batch = 1, seq‑len = 512). Mixed‑precision results without hierarchical adaptation are compared to HiFT’s hierarchical mixed‑precision, which fits a 7B model into ~16.9 GB and a 13B model into ~31 GB.

Mixed‑Precision Challenges for Large Models

Mixed‑precision training reduces dynamic activation memory by using half‑precision for forward passes, but weight updates still require 32‑bit copies to avoid underflow. For very large models, the static memory saved by half‑precision can be outweighed by the extra 16 GB needed to store 32‑bit weights, especially when batch sizes are limited.

Experiments show that for models around 3 B parameters, mixed‑precision offers little memory benefit at small batch sizes, while for larger models (e.g., GPT‑large) the benefit persists.

Hierarchical Mixed‑Precision Adaptation

The latest results extend HiFT with hierarchical mixed‑precision, achieving 16.87 GB GPU memory for a 7B model and 31 GB for a 13B model (batch = 1, seq‑len = 512). The implementation builds on Hugging Face and will be open‑sourced, compatible with LoRA and other PEFT techniques.

Figure 3(a) compares HiFT strategies (bottom‑to‑up, top‑to‑down, random) on RoBERTa‑base; (b) shows the effect of different group sizes. The performance differences are negligible, indicating that fine‑tuning order has little impact on final accuracy.

Comparison with MeZO and LOMO

Against MeZO, HiFT demonstrates a clear performance advantage on downstream tasks (see the original paper for detailed numbers).

Against LOMO on LLaMA‑2‑7B (batch = 1, seq‑len = 512) using the E2E dataset, HiFT’s peak memory is 16.87 GB (mixed‑precision) versus LOMO’s 21.57 GB; in single‑precision, HiFT uses 29.73 GB versus LOMO’s 60.06 GB.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM fine-tuning GPU memory mixed precision hierarchical training HiFT

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.