Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

Background

Running a 70‑billion‑parameter large language model (LLM) in fp16 normally requires about 140 GB of GPU memory, which exceeds the capacity of a typical 16 GB consumer GPU. The traditional loading pipeline creates the model, loads the full state_dict into CPU RAM, copies the weights to the GPU, and then runs inference. For the largest models the second step alone can need more than 1 TB of RAM, making the process impractical on limited hardware.

Why a 16 GB GPU Is Insufficient

Without special techniques a 70 B model would need at least four 40 GB A100 GPUs plus ample CPU RAM to hold parameters and intermediate copies during transfer.

Solution Overview

The key idea is to avoid loading the entire model into memory at once. PyTorch 1.9 introduced a meta device that allows tensors to be instantiated with only a shape, consuming no actual memory. By creating an empty model on the meta device, loading individual layer weights from disk to CPU, moving each layer to the GPU just before it is needed, and then freeing the memory, the whole 70 B model can be inferred on a single 16 GB GPU.

Step‑by‑Step Workflow

Create an empty (weight‑less) model on the meta device.

Determine the target device (CPU or GPU) for each layer.

Load a portion of the weights into CPU RAM.

Insert the loaded weights into the empty model.

Transfer the layer to the GPU for inference.

Repeat steps 3‑5 until all layers have been processed.

Meta Device Basics

A meta tensor stores only its shape; no data is allocated on CPU or GPU. This enables the creation of arbitrarily large tensors without exhausting physical memory. Example:

import torch
large_tensor = torch.randn(100000, 100000, device="meta")

Key Helper Functions

def load_layer_to_cpu(self, layer_name):
    self.weights_loader.set_state_dict(layer_name, self.device)
    state_dict = self.weights_loader.get_state_dict(self.device)
    if "value_head.weight" in state_dict:
        state_dict = {"lm_head.weight": state_dict["value_head.weight"]}
    return state_dict

def move_layer_to_device(self, state_dict):
    for param_name, param in state_dict.items():
        assert param.dtype != torch.int8, "int8 not supported (need to add fp16_statistics)"
        set_module_tensor_to_device(self.model, param_name, self.device, value=param, dtype=self.dtype)

def clean_memory():
    gc.collect()
    ctypes.CDLL("libc.so.6").malloc_trim(0)
    torch.cuda.empty_cache()

ShardedLlama Class

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights
from accelerate.utils.modeling import set_module_tensor_to_device
from safetensors.torch import load_file
from optimum.bettertransformer import BetterTransformer

class ShardedLlama:
    def __init__(self, checkpoint_path, weights_loader, device="cuda:0", dtype=torch.float16):
        self.checkpoint_path = Path(checkpoint_path)
        self.weights_loader = weights_loader
        self.device = device
        self.dtype = dtype
        self.config = AutoConfig.from_pretrained(self.checkpoint_path)
        self.tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"
        self.init_model()
        self.layer_names = ["model.embed_tokens"] + [f"model.layers.{i}" for i in range(len(self.model.model.layers))] + ["model.norm", "value_head"]

    def init_model(self):
        with init_empty_weights():
            self.model = AutoModelForCausalLM.from_config(self.config)
            self.model.lm_head = torch.nn.Linear(8192, 8, bias=False)
            self.model.eval()
            self.model = BetterTransformer.transform(self.model)
            self.model.tie_weights()
        self.layers = [self.model.model.embed_tokens] + list(self.model.model.layers) + [self.model.model.norm, self.model.lm_head]
        for buffer_name, buffer in self.model.named_buffers():
            set_module_tensor_to_device(self.model, buffer_name, self.device, value=buffer, dtype=self.dtype)

    # load_layer_to_cpu, move_layer_to_device, and __call__ are defined as shown above

Inference Loop

def run_model(device, df, weights_loader):
    model = ShardedLlama(checkpoint_path, weights_loader, device=device)
    f = partial(get_tokens, tokenizer=model.tokenizer)
    inputs = df.apply(f, axis=1).values
    batches = np.array_split(inputs, N_BATCHES)
    outputs = []
    for i, batch in enumerate(batches):
        outputs += model(batch)
    return outputs

Memory Management

After each layer is processed, the layer tensor is moved back to the meta device and clean_memory() is called to free both CPU and GPU memory, preventing accumulation of intermediate buffers.

Reference

Full implementation can be viewed at https://www.kaggle.com/code/simjeg/platypus2-70b-without-wikipedia-rag

PyTorchGPU memory optimizationmeta devicemodel sharding
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.