Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device
This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.
Background
Running a 70‑billion‑parameter large language model (LLM) in fp16 normally requires about 140 GB of GPU memory, which exceeds the capacity of a typical 16 GB consumer GPU. The traditional loading pipeline creates the model, loads the full state_dict into CPU RAM, copies the weights to the GPU, and then runs inference. For the largest models the second step alone can need more than 1 TB of RAM, making the process impractical on limited hardware.
Why a 16 GB GPU Is Insufficient
Without special techniques a 70 B model would need at least four 40 GB A100 GPUs plus ample CPU RAM to hold parameters and intermediate copies during transfer.
Solution Overview
The key idea is to avoid loading the entire model into memory at once. PyTorch 1.9 introduced a meta device that allows tensors to be instantiated with only a shape, consuming no actual memory. By creating an empty model on the meta device, loading individual layer weights from disk to CPU, moving each layer to the GPU just before it is needed, and then freeing the memory, the whole 70 B model can be inferred on a single 16 GB GPU.
Step‑by‑Step Workflow
Create an empty (weight‑less) model on the meta device.
Determine the target device (CPU or GPU) for each layer.
Load a portion of the weights into CPU RAM.
Insert the loaded weights into the empty model.
Transfer the layer to the GPU for inference.
Repeat steps 3‑5 until all layers have been processed.
Meta Device Basics
A meta tensor stores only its shape; no data is allocated on CPU or GPU. This enables the creation of arbitrarily large tensors without exhausting physical memory. Example:
import torch
large_tensor = torch.randn(100000, 100000, device="meta")Key Helper Functions
def load_layer_to_cpu(self, layer_name):
self.weights_loader.set_state_dict(layer_name, self.device)
state_dict = self.weights_loader.get_state_dict(self.device)
if "value_head.weight" in state_dict:
state_dict = {"lm_head.weight": state_dict["value_head.weight"]}
return state_dict
def move_layer_to_device(self, state_dict):
for param_name, param in state_dict.items():
assert param.dtype != torch.int8, "int8 not supported (need to add fp16_statistics)"
set_module_tensor_to_device(self.model, param_name, self.device, value=param, dtype=self.dtype)
def clean_memory():
gc.collect()
ctypes.CDLL("libc.so.6").malloc_trim(0)
torch.cuda.empty_cache()ShardedLlama Class
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights
from accelerate.utils.modeling import set_module_tensor_to_device
from safetensors.torch import load_file
from optimum.bettertransformer import BetterTransformer
class ShardedLlama:
def __init__(self, checkpoint_path, weights_loader, device="cuda:0", dtype=torch.float16):
self.checkpoint_path = Path(checkpoint_path)
self.weights_loader = weights_loader
self.device = device
self.dtype = dtype
self.config = AutoConfig.from_pretrained(self.checkpoint_path)
self.tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer.padding_side = "right"
self.init_model()
self.layer_names = ["model.embed_tokens"] + [f"model.layers.{i}" for i in range(len(self.model.model.layers))] + ["model.norm", "value_head"]
def init_model(self):
with init_empty_weights():
self.model = AutoModelForCausalLM.from_config(self.config)
self.model.lm_head = torch.nn.Linear(8192, 8, bias=False)
self.model.eval()
self.model = BetterTransformer.transform(self.model)
self.model.tie_weights()
self.layers = [self.model.model.embed_tokens] + list(self.model.model.layers) + [self.model.model.norm, self.model.lm_head]
for buffer_name, buffer in self.model.named_buffers():
set_module_tensor_to_device(self.model, buffer_name, self.device, value=buffer, dtype=self.dtype)
# load_layer_to_cpu, move_layer_to_device, and __call__ are defined as shown aboveInference Loop
def run_model(device, df, weights_loader):
model = ShardedLlama(checkpoint_path, weights_loader, device=device)
f = partial(get_tokens, tokenizer=model.tokenizer)
inputs = df.apply(f, axis=1).values
batches = np.array_split(inputs, N_BATCHES)
outputs = []
for i, batch in enumerate(batches):
outputs += model(batch)
return outputsMemory Management
After each layer is processed, the layer tensor is moved back to the meta device and clean_memory() is called to free both CPU and GPU memory, preventing accumulation of intermediate buffers.
Reference
Full implementation can be viewed at https://www.kaggle.com/code/simjeg/platypus2-70b-without-wikipedia-rag
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
